lex

Generate lexical analyzer (scanner) source code

TLDR

Generate an analyzer from a Lex file, storing it to the file lex.yy.c

$ lex [analyzer.l]

Specify the output file

$ lex -t [analyzer.l] > [analyzer.c]

Compile a C file generated by Lex

$ c99 [path/to/lex.yy.c] -o [executable]

SYNOPSIS

lex [options] [filename...]
Commonly:
lex filename.l

-t
    Write the generated C code to standard output instead of lex.yy.c.

-v
    Print a summary of statistics about the generated scanner (e.g., number of states, rules).

-n
    Suppress the printing of the statistics summary.

-o outfile
    Specify the name of the output C file instead of the default lex.yy.c. (Often a flex extension).

-F
    Use the faster, `flex`-specific scanner table representation. (flex option).

-V
    Display the version information for the `lex` implementation (typically `flex`).

DESCRIPTION

The lex command is a program that generates lexical analysis programs (scanners or tokenizers) from an input specification. It reads a set of regular expression rules and corresponding actions, typically written in C. lex then produces a C source file, conventionally named lex.yy.c, which defines the `yylex()` function. This function can be compiled and linked with a parser (often generated by yacc or bison) to form the front end of a compiler or interpreter. The generated scanner reads an input stream and partitions it into a sequence of tokens based on the rules provided in the lex specification file. On most modern Linux systems, the lex command is typically an implementation of flex (Fast Lexical Analyzer Generator), which offers enhanced performance and features while maintaining compatibility with traditional lex.

CAVEATS

The default output file name is always lex.yy.c unless an `-o` option (often specific to flex) is used.
Different implementations of lex (e.g., AT&T `lex`, `flex`, `DFAlex`) might have subtle differences in features or behavior.
Errors in the specification file can lead to verbose error messages in the generated C code, which can be challenging to debug.

INPUT FILE FORMAT

A lex specification file (conventionally ending in .l) typically consists of three sections, separated by `%%` delimiters:
1. Definitions Section: Contains C declarations and lex definitions (e.g., regular expression aliases).
2. Rules Section: Contains regular expression patterns and their corresponding C actions. This is the core of the specification.
3. User Subroutines Section: Contains additional C code, such as `main()` or other helper functions, which are copied verbatim to the output file.

INTEGRATION WITH YACC/BISON

lex is commonly used as the first stage in building a compiler or interpreter. The `yylex()` function generated by lex is called by a parser (often generated by yacc or bison) to retrieve the next token from the input stream. This modular approach separates lexical analysis from syntactic analysis, making the compiler development process more manageable.

HISTORY

The original lex lexical analyzer generator was developed by Mike Lesk and Eric Schmidt at Bell Labs in the early 1970s. It became an integral part of the Unix toolchain, often used in conjunction with yacc (Yet Another Compiler Compiler). Over time, various reimplementations and enhancements emerged, with flex (Fast Lexical Analyzer Generator), developed by Vern Paxson, becoming the de facto standard `lex` implementation on most modern Unix-like systems, including Linux, due to its improved performance and features.

lex