Getting Started with a Lexical Analyzer

In older computer generations, running simple code was much harder than one might have thought. This challenge was because computers did everything in machine languages written on punch cards. It then went to mnemonics, as used in the Assembly language.

These were simpler but still required an understanding of machine language. Developers needed high-level languages more than ever since computers seemed effective in repetitive tasks.

The invention of a high-level language compiler was a significant breakthrough in the computer growth phase. This was because high-level languages are understandable by most people, though much overhead is needed.

Hence, for one to be able to appreciate the compilers and their work in code conversion from source to optimized object code, one has to understand how they are designed.

Compiler design involves several phases. Such phases include the following:

Lexical analysis
Syntax analysis
Semantic analysis
Intermediate code generation
Code optimization
Target code generation
Error handling
Symbol table generation

Compiler Construction Steps

A Lexical analyzer is found in the first phase of compiler construction.

Key takeaways
Pre-requisites
What is Lexical analysis?
Fundamental terms
A token count based on a sample code
Lexical analysis tools
Set up the tools
Architecture of Lex files
Create a Lex file and run it
A Lex analyzer that converts instances of certain characters into a particular pattern
A lex analyzer that identifies certain words
Conclusion
References

Key takeaways

In this article, one will learn the following:

What is Lexical analysis
Lexical analysis: core principles
Importance of Lexical analysis
Application of the principles in creating a lexical analyzer
Building a lexical analyzer with code examples

Pre-requisites

For one to follow up with the article, the following knowledge is required:

A good internet connection
Know how to use Regular Expressions
Knowledge of C programming
A good IDE or a text editor of one's choice

Let us get into the article now:

What is Lexical analysis?

Lexical analysis is the first phase of the compiler, also known as a scanner. It is the process of converting a high-level source code into a series of tokens that the compiler can easily recognize.

These tokens are then passed through a series of steps to check if they are in the correct format. These steps include checking its syntax and semantics. As a result, the input to lexical analyzers is a programmer's high-level language source code.

It then splits it into smaller chunks that can be quickly and easily analyzed and worked on.

NOTE: Lexical analyzers do not change code or check for errors.

Fundamental terms

Some terms that are used include:

Tokens
Patterns
Lexeme

Let us take a look at what they are:

Tokens

Tokens are defined as anything that can be recognized in Context-Free Grammars (CFGs). It represents an information unit. They may be between terminals, non-terminals, production rules, and the Start symbol (T, N, P, S).

Terminals (T) are characters that represent the final value
Non-terminals (N) can be replaced with either terminals or non-terminals.
Production rules (P) are combinations of terminals and non-terminals that can be combined in a format that makes sense and represents certain operations.
The Start (S) symbol, on the other hand, is a special non-terminal symbol that shows the start of an application.

Patterns

Patterns can be of specific formats that are used to recognize tokens. Compilers can use them to identify letters, words, and special characters. These can then be classified as terminals or non-terminals, among others.

Patterns can be user-defined or come pre-defined with the program beforehand.

Lexemes

On the other hand, lexemes can be said as a combination of tokens that convey a more understandable meaning when combined.

A token count based on a sample code

This Java code prints a 'Hello World' message when running the code below:

/*Print 'Hello World!'*/
/*Comments and whitespaces are not counted*/
public class HelloWorld
{
  public static void main(String[] args) {
    System.out.println("Hello World!");
  }
}

The number of tokens recognized by a lexical analyzer is 27.

Check out the image below to see how they are counted:

Token count

The lexical analyzers recognize tokens such as:

Keywords
Special characters
Macros and special directives
Strings, integers, floats, identifiers, among others.

This kind of token can then be passed to syntax analyzers, generating parse trees based on the Symbol table to see if they are correct.

Non-tokens include spaces, tabs, and comments, among others.

Lexical analysis tools

Some tools used to create and use lexical analyzers include:

LEX tool
YACC
A C compiler

LEX tool

It is also known as Flex (GNU tool). It creates a C source core.

It can be integrated with any application or can work as a stand-alone application.

It first sets some configurations that define the tokens expected from an input and what actions to take before parsing (sending them to a syntactical analyzer).

YACC (Yet Another Compiler Compiler)

A UNIX standard utility compiles LALR (1) grammar using the Look Ahead Left-to-Right parsing style. It is an open-source compiler that generates code for the parser in the C language.

C compiler

This is a typical C language compiler. GNU CC will work out perfectly.

Set up the tools

This section will feature the installation of the lexical analysis tools discussed above.

Linux Installation

To do this on Ubuntu, run the following commands:

Get current software updates for installed apps

sudo apt-get update

Install Flex GNU tools

sudo apt-get install flex

Install Bison as follows:

sudo apt-get install bison

Bison is a general-purpose YACC-compatible parser generator that converts a grammar description for an LALR(1) context-free grammar into a C program to parse that grammar. One may use it to develop a range of language parsers. Bison has upward compatibility with Yacc.

Install Yacc

sudo apt-get install byacc

One can also install the Bison package to generate a parser in C or C++ from a BNF notation. One can do it by running the command below:

sudo apt-get install bison++

Windows Installation

To install Flex and Bison for Yacc, do the following:

Head over to this site. All GNU utilities such as Bison and Flex can be found there.
Download Bison Setup found on the same site or by clicking here.
Install it by running it.
Download Flex setup by clicking here. Install it after the downloading process is complete.

Add the binaries to the PATH variable by doing the following:

Open the folder where the program files for Bison and Flex are installed. It is by default found in the C:\Program Files (x86)\GnuWin32 file.
Copy its location path.
Search in the Windows taskbar for Environment Variables and select the 'Edit the System Environment Variables' option.
Select the 'Environment Variables' option.
In the 'Path' option, paste the location of the bin folder. The folder path to be pasted is C:\Program Files (x86)\GnuWin32\bin.

Path location

Save it and close the windows.

Architecture of Lex files

Lex files have a .l extension.

However, before creating a Lex file, let us first know the parts of a Lex file.

Lex file parts

Lex files have three main parts. These parts are the following:

Definition
Rules
Subroutines

Definition

It contains literal blocks, definitions, internal table declarations, start conditions, and translations.

It can contain things like the 'include' section found in the C program. This statement states the things that the program could need.

Definitions can be written inside the %{ and %} signs.

Rules and actions

This section carries patterns and C code(actions). These patterns can be in ECMAScript, RegEx or wildcard format.

It specifies the tokens the program checks, and the C code states what happens when they are found in the Actions section.

Actions can be found inside the curly brackets ({ ... }). The actions or macros can be optional too. Hence, they can be left blank.

Subroutines

C codes that state additional user-defined functions executed when the patterns are found or not found. They also contain other functions that the program may execute on the run.

These are optional too.

The three sections are divided by delimiters (%%) as shown below:

%{    
    
       
%}
%%


%%

y.tab.c is an output file that is compiled producing a yyparse function. It is produced with a C language compiler.

The image below shows a brief explanation of the code structure based on the parts defined and explained above:

Code structure

Create a Lex file and run it

Let us create a Lex analyzer that counts the number of words in a file and their total size.

Create a root folder named 'flex-projects'. In it, create a file named counter.l. Open it and add the lines of code that state the program declarations as follows:

/* Definitions */ 
%{
  int lines = 1,
  words = 0,
  lowercase_letters = 0,
  uppercase_letters = 0,
  numbers = 0,
  special_characters = 0,
  totals = 0,
  size = 0; %
}

/* Delimiters: These separate the definitions from the Rules and actions section */
%%

Add the rules and actions. Uppercase and lowercase letters, numbers, special characters, letters, and line instances will all be checked. Check it out in the code below:

\n { lines++; words++;} /* Adds to the lines and words variables a value of 1 when one goes to a new line */
 [\t ' '] words++; /* Adds a value to the words variable when one moves from one word to another (words are separated by a space) */
 [A-Z] uppercase_letters++;
 [a-z] lowercase_letters++;
 [0-9] numbers++;
 [$&+,:;=?@#|'<>.^*()%!-] special_characters++;
 
 /* Separate the Rules and Action Section from the User sub-routines */
 %%

In the User sub-routines section, add the main program that prints the totals.

int yywrap() {

}

int main() {
  yyin = fopen("test.txt", "r"); /* Open the 'test.txt' file in a read only fomart */
  yylex();
  totals = special_characters + numbers + uppercase_letters + lowercase_letters;
  size = (totals * 1); /* The total size in bytes is equal to the total number of characters multiplied by one byte since each is one byte */

  /* Prints the output */
  printf("This file contains the following:");
  printf("\n\t%d lines", lines);
  printf("\n\t%d words", words);
  printf("\n\t%d Lowercase letters", lowercase_letters);
  printf("\n\t%d Uppercase letters", uppercase_letters);
  printf("\n\t%d digits", numbers);
  printf("\n\t%d special characters", special_characters);
  printf("\n\t The total size of the file characters in bytes is: %d bytes.\n", size);
}

Create a new file named 'test.txt' and open it. Add the following text to it:

Hello Mark and Jane! Did you go for launch?
I am coming back tomorrow. Don't go home yet...
I will visit you at 0945hrs on 27th November this year

Open the folder where the file resides in the terminal or command line (in Windows). Change the lex file into a C Source Core using the command below:

flex counter.l

Convert the C Source Core into an executable file by running the code below in the terminal:

gcc lex.yy.c

Run the executable. This process runs the analyzer created so that it can use the input file as the input and generate the output on the terminals:

./a.out

How the process looks like:

Lex Analyzer Process

The output on the terminal is as shown below:

Terminal code run output

YACC comes with predefined language reserved words, also known as keywords. Among these terms are "yytext," "yylex," and "yywrap", to name a few.

As one can notice, they start with yy. This step is because YACC uses that prefix before keywords to recognize them by default.

One can also set the program to use another prefix of two characters. The characters can be lowercase, one uppercase, and the other lowercase.

Both can not be uppercase since all uppercase combinations have been used for other keywords.

yylex() is a function that passes tokens found to yyparse() each time it is called. It returns 0 or a negative value whenever an EOF (End of File) error is returned. It utilizes a routine called yygetc() to fetch these values.

yyparse() calls upon yylex() to obtain any tokens passed to the program. It then matches it against the patterns used for matches to determine the kind of action to be performed.

yywrap() is used to show the end of the execution of the code. It returns a value of 1 if the program's run is over and a value of 0 if it is ongoing.

Yacc also has some predefined variables. These variables include the following:

yytext: It defines the current recognizable input token by the lexical scanner.
yylen: It is a variable that holds the length of the input text.
yyout: It determines the output stream for the output macro that processes input that does not match any rules set.
yyin: It determines the input stream for the input functions and yylex().
yylineno: It specifies the current input line number, which is kept up to date by the input and the yycomment.

yyin and yyout can be interchanged. This interchanging can be done by assignment.

A Lex analyzer that converts instances of certain characters into a particular pattern

Let us create a lex analyzer that changes 'abc' occurrences in the input to 'ABC'. Create a file named strings.l. In it, add the following in the definitions section:

%{
    /* Find instances of 'abc' to 'ABC' */

#include<stdio.h>

#include<string.h>

int i;

%}

%%

As for comments in a lex file, make sure to begin with double-spacing first to avoid errors.

Add rules to check on words only entered (with either upper or lowercase) containing the characters 'abc' following each other. If yes, then replace the specified characters only.

[a - z A - Z] * {

  for (i = 0; i <= yyleng; i++)

  {

    if ((yytext[i] == 'a') && (yytext[i + 1] == 'b') && (yytext[i + 2] == 'c'))

    {

      yytext[i] = 'A';

      yytext[i + 1] = 'B';

      yytext[i + 2] = 'C';

    }

  }

  printf("%s", yytext);

}

.*{
  ECHO;
}

\
n {
  printf("%s", yytext);
}

%%

As for the user subroutines section, copy-paste the following:

int main() {
  yylex();

}

int yywrap() {
  return 1;
}

ECHO is a macro that is used as an action to copy the matched input token in the yytext variable to the lex output stream found in yyout.

It is a macro just like the yygetc() and BEGIN, among others. Run the following command in the terminal:

lex strings.l

Then, run the following:

cc lex.yy.c

Flex and gcc can be replaced with lex and cc as seen above.

Run the following command to run the executable:

./a.out

It now allows input from the terminal. Key in the following to see if it works:

We can see abc turn to ABC in javaabcfilewithoutknowingABCDabcd

The output is as follows:

Second program output

A lex analyzer that identifies certain words

This lex analyzer will identify words among a pre-defined set. It then returns an output if it is among the Set or not.

In this case, it shall be checking for verbs. Create a new file known as verbs.l. Begin with the definitions as follows:

%{
    //program to classify a word as a verb or not a verb
%}

%%

In the rules and actions section, paste the following:

[\t]+ //Ignore the white spaces/tabs 

is |
am |
are | 
were |
was |
be |
being |
been |
do |
does |
did |
will |
would |
should |
can |
could |
has |
have |
had |
go   { printf("%s: is a verb\n", yytext);}
[a-zA-Z]+ {printf("%s: is not a verb\n",yytext);}

. |\n {ECHO; /* normal default anyway*/}

%%

This code contains a series of verbs that act as the patterns defined. If it finds any of them, it returns and says it is a verb; otherwise, it is not. For each word passed to it, e.g., in a sentence, it does this.

In the User subroutine section, create the main application as shown below:

int yywrap(){}

int main(){
    yylex();
    return 0;
}

Run it by executing the following in the terminal:

lex verbs.l

Convert the C source core created into an executable analyzer using the command below:

cc lex.yy.c

Run the executable as follows:

./a.out

Key in some characters on the terminal and check out the results. For example, key in the sentence below:

Kelly is going home today

The results look as shown below:

Check on verbs in a sentence

You can find the full code here.

Conclusion

Creating high-level compilers was one of the most significant steps in the computer software development process. High-level language compilers attract more developers to the existing languages due to their ease of understanding.

Lexical analysis is a phase in the compilation process of crucial importance. It identifies tokens and passes them to Syntax analysis before proceeding to other steps.

The knowledge of how the analyzers are created is essential in making one try to understand how compilers work.

References

Lexical analysis definitions
Lexical analysis research document
Lexical analyzer IBM documentation
Lexical analysis reference material

Peer Review Contributions by: Srishilesh P S

Author

Francis Kaguongo

Francis Kaguongo is an undergraduate student pursuing a Bachelor of Science in Computer Technology at Jomo Kenyatta University of Agriculture and Technology. He is interested in Internet security, Network Automation, DevOps, Network Security, Fullstack System Development, and any upcoming trend in technology. He spends most of his time learning new systems, processes, and technologies both in software and hardware systems.