3697 lines
123 KiB
Groff
3697 lines
123 KiB
Groff
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
NAME
|
||
|
flex - fast lexical analyzer generator
|
||
|
|
||
|
SYNOPSIS
|
||
|
flex [-bcdfhilnpstvwBFILTV78+? -C[aefFmr] -ooutput -Pprefix
|
||
|
-Sskeleton] [--help --version] [filename ...]
|
||
|
|
||
|
OVERVIEW
|
||
|
This manual describes flex, a tool for generating programs
|
||
|
that perform pattern-matching on text. The manual includes
|
||
|
both tutorial and reference sections:
|
||
|
|
||
|
Description
|
||
|
a brief overview of the tool
|
||
|
|
||
|
Some Simple Examples
|
||
|
|
||
|
Format Of The Input File
|
||
|
|
||
|
Patterns
|
||
|
the extended regular expressions used by flex
|
||
|
|
||
|
How The Input Is Matched
|
||
|
the rules for determining what has been matched
|
||
|
|
||
|
Actions
|
||
|
how to specify what to do when a pattern is matched
|
||
|
|
||
|
The Generated Scanner
|
||
|
details regarding the scanner that flex produces;
|
||
|
how to control the input source
|
||
|
|
||
|
Start Conditions
|
||
|
introducing context into your scanners, and
|
||
|
managing "mini-scanners"
|
||
|
|
||
|
Multiple Input Buffers
|
||
|
how to manipulate multiple input sources; how to
|
||
|
scan from strings instead of files
|
||
|
|
||
|
End-of-file Rules
|
||
|
special rules for matching the end of the input
|
||
|
|
||
|
Miscellaneous Macros
|
||
|
a summary of macros available to the actions
|
||
|
|
||
|
Values Available To The User
|
||
|
a summary of values available to the actions
|
||
|
|
||
|
Interfacing With Yacc
|
||
|
connecting flex scanners together with yacc parsers
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 1
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
Options
|
||
|
flex command-line options, and the "%option"
|
||
|
directive
|
||
|
|
||
|
Performance Considerations
|
||
|
how to make your scanner go as fast as possible
|
||
|
|
||
|
Generating C++ Scanners
|
||
|
the (experimental) facility for generating C++
|
||
|
scanner classes
|
||
|
|
||
|
Incompatibilities With Lex And POSIX
|
||
|
how flex differs from AT&T lex and the POSIX lex
|
||
|
standard
|
||
|
|
||
|
Diagnostics
|
||
|
those error messages produced by flex (or scanners
|
||
|
it generates) whose meanings might not be apparent
|
||
|
|
||
|
Files
|
||
|
files used by flex
|
||
|
|
||
|
Deficiencies / Bugs
|
||
|
known problems with flex
|
||
|
|
||
|
See Also
|
||
|
other documentation, related tools
|
||
|
|
||
|
Author
|
||
|
includes contact information
|
||
|
|
||
|
|
||
|
DESCRIPTION
|
||
|
flex is a tool for generating scanners: programs which
|
||
|
recognized lexical patterns in text. flex reads the given
|
||
|
input files, or its standard input if no file names are
|
||
|
given, for a description of a scanner to generate. The
|
||
|
description is in the form of pairs of regular expressions
|
||
|
and C code, called rules. flex generates as output a C
|
||
|
source file, lex.yy.c, which defines a routine yylex(). This
|
||
|
file is compiled and linked with the -lfl library to produce
|
||
|
an executable. When the executable is run, it analyzes its
|
||
|
input for occurrences of the regular expressions. Whenever
|
||
|
it finds one, it executes the corresponding C code.
|
||
|
|
||
|
SOME SIMPLE EXAMPLES
|
||
|
First some simple examples to get the flavor of how one uses
|
||
|
flex. The following flex input specifies a scanner which
|
||
|
whenever it encounters the string "username" will replace it
|
||
|
with the user's login name:
|
||
|
|
||
|
%%
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 2
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
username printf( "%s", getlogin() );
|
||
|
|
||
|
By default, any text not matched by a flex scanner is copied
|
||
|
to the output, so the net effect of this scanner is to copy
|
||
|
its input file to its output with each occurrence of "user-
|
||
|
name" expanded. In this input, there is just one rule.
|
||
|
"username" is the pattern and the "printf" is the action.
|
||
|
The "%%" marks the beginning of the rules.
|
||
|
|
||
|
Here's another simple example:
|
||
|
|
||
|
int num_lines = 0, num_chars = 0;
|
||
|
|
||
|
%%
|
||
|
\n ++num_lines; ++num_chars;
|
||
|
. ++num_chars;
|
||
|
|
||
|
%%
|
||
|
main()
|
||
|
{
|
||
|
yylex();
|
||
|
printf( "# of lines = %d, # of chars = %d\n",
|
||
|
num_lines, num_chars );
|
||
|
}
|
||
|
|
||
|
This scanner counts the number of characters and the number
|
||
|
of lines in its input (it produces no output other than the
|
||
|
final report on the counts). The first line declares two
|
||
|
globals, "num_lines" and "num_chars", which are accessible
|
||
|
both inside yylex() and in the main() routine declared after
|
||
|
the second "%%". There are two rules, one which matches a
|
||
|
newline ("\n") and increments both the line count and the
|
||
|
character count, and one which matches any character other
|
||
|
than a newline (indicated by the "." regular expression).
|
||
|
|
||
|
A somewhat more complicated example:
|
||
|
|
||
|
/* scanner for a toy Pascal-like language */
|
||
|
|
||
|
%{
|
||
|
/* need this for the call to atof() below */
|
||
|
#include <math.h>
|
||
|
%}
|
||
|
|
||
|
DIGIT [0-9]
|
||
|
ID [a-z][a-z0-9]*
|
||
|
|
||
|
%%
|
||
|
|
||
|
{DIGIT}+ {
|
||
|
printf( "An integer: %s (%d)\n", yytext,
|
||
|
atoi( yytext ) );
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 3
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
}
|
||
|
|
||
|
{DIGIT}+"."{DIGIT}* {
|
||
|
printf( "A float: %s (%g)\n", yytext,
|
||
|
atof( yytext ) );
|
||
|
}
|
||
|
|
||
|
if|then|begin|end|procedure|function {
|
||
|
printf( "A keyword: %s\n", yytext );
|
||
|
}
|
||
|
|
||
|
{ID} printf( "An identifier: %s\n", yytext );
|
||
|
|
||
|
"+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext );
|
||
|
|
||
|
"{"[^}\n]*"}" /* eat up one-line comments */
|
||
|
|
||
|
[ \t\n]+ /* eat up whitespace */
|
||
|
|
||
|
. printf( "Unrecognized character: %s\n", yytext );
|
||
|
|
||
|
%%
|
||
|
|
||
|
main( argc, argv )
|
||
|
int argc;
|
||
|
char **argv;
|
||
|
{
|
||
|
++argv, --argc; /* skip over program name */
|
||
|
if ( argc > 0 )
|
||
|
yyin = fopen( argv[0], "r" );
|
||
|
else
|
||
|
yyin = stdin;
|
||
|
|
||
|
yylex();
|
||
|
}
|
||
|
|
||
|
This is the beginnings of a simple scanner for a language
|
||
|
like Pascal. It identifies different types of tokens and
|
||
|
reports on what it has seen.
|
||
|
|
||
|
The details of this example will be explained in the follow-
|
||
|
ing sections.
|
||
|
|
||
|
FORMAT OF THE INPUT FILE
|
||
|
The flex input file consists of three sections, separated by
|
||
|
a line with just %% in it:
|
||
|
|
||
|
definitions
|
||
|
%%
|
||
|
rules
|
||
|
%%
|
||
|
user code
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 4
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
The definitions section contains declarations of simple name
|
||
|
definitions to simplify the scanner specification, and
|
||
|
declarations of start conditions, which are explained in a
|
||
|
later section.
|
||
|
|
||
|
Name definitions have the form:
|
||
|
|
||
|
name definition
|
||
|
|
||
|
The "name" is a word beginning with a letter or an under-
|
||
|
score ('_') followed by zero or more letters, digits, '_',
|
||
|
or '-' (dash). The definition is taken to begin at the
|
||
|
first non-white-space character following the name and con-
|
||
|
tinuing to the end of the line. The definition can subse-
|
||
|
quently be referred to using "{name}", which will expand to
|
||
|
"(definition)". For example,
|
||
|
|
||
|
DIGIT [0-9]
|
||
|
ID [a-z][a-z0-9]*
|
||
|
|
||
|
defines "DIGIT" to be a regular expression which matches a
|
||
|
single digit, and "ID" to be a regular expression which
|
||
|
matches a letter followed by zero-or-more letters-or-digits.
|
||
|
A subsequent reference to
|
||
|
|
||
|
{DIGIT}+"."{DIGIT}*
|
||
|
|
||
|
is identical to
|
||
|
|
||
|
([0-9])+"."([0-9])*
|
||
|
|
||
|
and matches one-or-more digits followed by a '.' followed by
|
||
|
zero-or-more digits.
|
||
|
|
||
|
The rules section of the flex input contains a series of
|
||
|
rules of the form:
|
||
|
|
||
|
pattern action
|
||
|
|
||
|
where the pattern must be unindented and the action must
|
||
|
begin on the same line.
|
||
|
|
||
|
See below for a further description of patterns and actions.
|
||
|
|
||
|
Finally, the user code section is simply copied to lex.yy.c
|
||
|
verbatim. It is used for companion routines which call or
|
||
|
are called by the scanner. The presence of this section is
|
||
|
optional; if it is missing, the second %% in the input file
|
||
|
may be skipped, too.
|
||
|
|
||
|
In the definitions and rules sections, any indented text or
|
||
|
text enclosed in %{ and %} is copied verbatim to the output
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 5
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
(with the %{}'s removed). The %{}'s must appear unindented
|
||
|
on lines by themselves.
|
||
|
|
||
|
In the rules section, any indented or %{} text appearing
|
||
|
before the first rule may be used to declare variables which
|
||
|
are local to the scanning routine and (after the declara-
|
||
|
tions) code which is to be executed whenever the scanning
|
||
|
routine is entered. Other indented or %{} text in the rule
|
||
|
section is still copied to the output, but its meaning is
|
||
|
not well-defined and it may well cause compile-time errors
|
||
|
(this feature is present for POSIX compliance; see below for
|
||
|
other such features).
|
||
|
|
||
|
In the definitions section (but not in the rules section),
|
||
|
an unindented comment (i.e., a line beginning with "/*") is
|
||
|
also copied verbatim to the output up to the next "*/".
|
||
|
|
||
|
PATTERNS
|
||
|
The patterns in the input are written using an extended set
|
||
|
of regular expressions. These are:
|
||
|
|
||
|
x match the character 'x'
|
||
|
. any character (byte) except newline
|
||
|
[xyz] a "character class"; in this case, the pattern
|
||
|
matches either an 'x', a 'y', or a 'z'
|
||
|
[abj-oZ] a "character class" with a range in it; matches
|
||
|
an 'a', a 'b', any letter from 'j' through 'o',
|
||
|
or a 'Z'
|
||
|
[^A-Z] a "negated character class", i.e., any character
|
||
|
but those in the class. In this case, any
|
||
|
character EXCEPT an uppercase letter.
|
||
|
[^A-Z\n] any character EXCEPT an uppercase letter or
|
||
|
a newline
|
||
|
r* zero or more r's, where r is any regular expression
|
||
|
r+ one or more r's
|
||
|
r? zero or one r's (that is, "an optional r")
|
||
|
r{2,5} anywhere from two to five r's
|
||
|
r{2,} two or more r's
|
||
|
r{4} exactly 4 r's
|
||
|
{name} the expansion of the "name" definition
|
||
|
(see above)
|
||
|
"[xyz]\"foo"
|
||
|
the literal string: [xyz]"foo
|
||
|
\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
|
||
|
then the ANSI-C interpretation of \x.
|
||
|
Otherwise, a literal 'X' (used to escape
|
||
|
operators such as '*')
|
||
|
\0 a NUL character (ASCII code 0)
|
||
|
\123 the character with octal value 123
|
||
|
\x2a the character with hexadecimal value 2a
|
||
|
(r) match an r; parentheses are used to override
|
||
|
precedence (see below)
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 6
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
rs the regular expression r followed by the
|
||
|
regular expression s; called "concatenation"
|
||
|
|
||
|
|
||
|
r|s either an r or an s
|
||
|
|
||
|
|
||
|
r/s an r but only if it is followed by an s. The
|
||
|
text matched by s is included when determining
|
||
|
whether this rule is the "longest match",
|
||
|
but is then returned to the input before
|
||
|
the action is executed. So the action only
|
||
|
sees the text matched by r. This type
|
||
|
of pattern is called trailing context".
|
||
|
(There are some combinations of r/s that flex
|
||
|
cannot match correctly; see notes in the
|
||
|
Deficiencies / Bugs section below regarding
|
||
|
"dangerous trailing context".)
|
||
|
^r an r, but only at the beginning of a line (i.e.,
|
||
|
which just starting to scan, or right after a
|
||
|
newline has been scanned).
|
||
|
r$ an r, but only at the end of a line (i.e., just
|
||
|
before a newline). Equivalent to "r/\n".
|
||
|
|
||
|
Note that flex's notion of "newline" is exactly
|
||
|
whatever the C compiler used to compile flex
|
||
|
interprets '\n' as; in particular, on some DOS
|
||
|
systems you must either filter out \r's in the
|
||
|
input yourself, or explicitly use r/\r\n for "r$".
|
||
|
|
||
|
|
||
|
<s>r an r, but only in start condition s (see
|
||
|
below for discussion of start conditions)
|
||
|
<s1,s2,s3>r
|
||
|
same, but in any of start conditions s1,
|
||
|
s2, or s3
|
||
|
<*>r an r in any start condition, even an exclusive one.
|
||
|
|
||
|
|
||
|
<<EOF>> an end-of-file
|
||
|
<s1,s2><<EOF>>
|
||
|
an end-of-file when in start condition s1 or s2
|
||
|
|
||
|
Note that inside of a character class, all regular expres-
|
||
|
sion operators lose their special meaning except escape
|
||
|
('\') and the character class operators, '-', ']', and, at
|
||
|
the beginning of the class, '^'.
|
||
|
|
||
|
The regular expressions listed above are grouped according
|
||
|
to precedence, from highest precedence at the top to lowest
|
||
|
at the bottom. Those grouped together have equal pre-
|
||
|
cedence. For example,
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 7
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
foo|bar*
|
||
|
|
||
|
is the same as
|
||
|
|
||
|
(foo)|(ba(r*))
|
||
|
|
||
|
since the '*' operator has higher precedence than concatena-
|
||
|
tion, and concatenation higher than alternation ('|'). This
|
||
|
pattern therefore matches either the string "foo" or the
|
||
|
string "ba" followed by zero-or-more r's. To match "foo" or
|
||
|
zero-or-more "bar"'s, use:
|
||
|
|
||
|
foo|(bar)*
|
||
|
|
||
|
and to match zero-or-more "foo"'s-or-"bar"'s:
|
||
|
|
||
|
(foo|bar)*
|
||
|
|
||
|
|
||
|
In addition to characters and ranges of characters, charac-
|
||
|
ter classes can also contain character class expressions.
|
||
|
These are expressions enclosed inside [: and :] delimiters
|
||
|
(which themselves must appear between the '[' and ']' of the
|
||
|
character class; other elements may occur inside the charac-
|
||
|
ter class, too). The valid expressions are:
|
||
|
|
||
|
[:alnum:] [:alpha:] [:blank:]
|
||
|
[:cntrl:] [:digit:] [:graph:]
|
||
|
[:lower:] [:print:] [:punct:]
|
||
|
[:space:] [:upper:] [:xdigit:]
|
||
|
|
||
|
These expressions all designate a set of characters
|
||
|
equivalent to the corresponding standard C isXXX function.
|
||
|
For example, [:alnum:] designates those characters for which
|
||
|
isalnum() returns true - i.e., any alphabetic or numeric.
|
||
|
Some systems don't provide isblank(), so flex defines
|
||
|
[:blank:] as a blank or a tab.
|
||
|
|
||
|
For example, the following character classes are all
|
||
|
equivalent:
|
||
|
|
||
|
[[:alnum:]]
|
||
|
[[:alpha:][:digit:]
|
||
|
[[:alpha:]0-9]
|
||
|
[a-zA-Z0-9]
|
||
|
|
||
|
If your scanner is case-insensitive (the -i flag), then
|
||
|
[:upper:] and [:lower:] are equivalent to [:alpha:].
|
||
|
|
||
|
Some notes on patterns:
|
||
|
|
||
|
- A negated character class such as the example "[^A-Z]"
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 8
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
above will match a newline unless "\n" (or an
|
||
|
equivalent escape sequence) is one of the characters
|
||
|
explicitly present in the negated character class
|
||
|
(e.g., "[^A-Z\n]"). This is unlike how many other reg-
|
||
|
ular expression tools treat negated character classes,
|
||
|
but unfortunately the inconsistency is historically
|
||
|
entrenched. Matching newlines means that a pattern
|
||
|
like [^"]* can match the entire input unless there's
|
||
|
another quote in the input.
|
||
|
|
||
|
- A rule can have at most one instance of trailing con-
|
||
|
text (the '/' operator or the '$' operator). The start
|
||
|
condition, '^', and "<<EOF>>" patterns can only occur
|
||
|
at the beginning of a pattern, and, as well as with '/'
|
||
|
and '$', cannot be grouped inside parentheses. A '^'
|
||
|
which does not occur at the beginning of a rule or a
|
||
|
'$' which does not occur at the end of a rule loses its
|
||
|
special properties and is treated as a normal charac-
|
||
|
ter.
|
||
|
|
||
|
The following are illegal:
|
||
|
|
||
|
foo/bar$
|
||
|
<sc1>foo<sc2>bar
|
||
|
|
||
|
Note that the first of these, can be written
|
||
|
"foo/bar\n".
|
||
|
|
||
|
The following will result in '$' or '^' being treated
|
||
|
as a normal character:
|
||
|
|
||
|
foo|(bar$)
|
||
|
foo|^bar
|
||
|
|
||
|
If what's wanted is a "foo" or a bar-followed-by-a-
|
||
|
newline, the following could be used (the special '|'
|
||
|
action is explained below):
|
||
|
|
||
|
foo |
|
||
|
bar$ /* action goes here */
|
||
|
|
||
|
A similar trick will work for matching a foo or a bar-
|
||
|
at-the-beginning-of-a-line.
|
||
|
|
||
|
HOW THE INPUT IS MATCHED
|
||
|
When the generated scanner is run, it analyzes its input
|
||
|
looking for strings which match any of its patterns. If it
|
||
|
finds more than one match, it takes the one matching the
|
||
|
most text (for trailing context rules, this includes the
|
||
|
length of the trailing part, even though it will then be
|
||
|
returned to the input). If it finds two or more matches of
|
||
|
the same length, the rule listed first in the flex input
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 9
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
file is chosen.
|
||
|
|
||
|
Once the match is determined, the text corresponding to the
|
||
|
match (called the token) is made available in the global
|
||
|
character pointer yytext, and its length in the global
|
||
|
integer yyleng. The action corresponding to the matched pat-
|
||
|
tern is then executed (a more detailed description of
|
||
|
actions follows), and then the remaining input is scanned
|
||
|
for another match.
|
||
|
|
||
|
If no match is found, then the default rule is executed: the
|
||
|
next character in the input is considered matched and copied
|
||
|
to the standard output. Thus, the simplest legal flex input
|
||
|
is:
|
||
|
|
||
|
%%
|
||
|
|
||
|
which generates a scanner that simply copies its input (one
|
||
|
character at a time) to its output.
|
||
|
|
||
|
Note that yytext can be defined in two different ways:
|
||
|
either as a character pointer or as a character array. You
|
||
|
can control which definition flex uses by including one of
|
||
|
the special directives %pointer or %array in the first
|
||
|
(definitions) section of your flex input. The default is
|
||
|
%pointer, unless you use the -l lex compatibility option, in
|
||
|
which case yytext will be an array. The advantage of using
|
||
|
%pointer is substantially faster scanning and no buffer
|
||
|
overflow when matching very large tokens (unless you run out
|
||
|
of dynamic memory). The disadvantage is that you are res-
|
||
|
tricted in how your actions can modify yytext (see the next
|
||
|
section), and calls to the unput() function destroys the
|
||
|
present contents of yytext, which can be a considerable
|
||
|
porting headache when moving between different lex versions.
|
||
|
|
||
|
The advantage of %array is that you can then modify yytext
|
||
|
to your heart's content, and calls to unput() do not destroy
|
||
|
yytext (see below). Furthermore, existing lex programs
|
||
|
sometimes access yytext externally using declarations of the
|
||
|
form:
|
||
|
extern char yytext[];
|
||
|
This definition is erroneous when used with %pointer, but
|
||
|
correct for %array.
|
||
|
|
||
|
%array defines yytext to be an array of YYLMAX characters,
|
||
|
which defaults to a fairly large value. You can change the
|
||
|
size by simply #define'ing YYLMAX to a different value in
|
||
|
the first section of your flex input. As mentioned above,
|
||
|
with %pointer yytext grows dynamically to accommodate large
|
||
|
tokens. While this means your %pointer scanner can accommo-
|
||
|
date very large tokens (such as matching entire blocks of
|
||
|
comments), bear in mind that each time the scanner must
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 10
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
resize yytext it also must rescan the entire token from the
|
||
|
beginning, so matching such tokens can prove slow. yytext
|
||
|
presently does not dynamically grow if a call to unput()
|
||
|
results in too much text being pushed back; instead, a run-
|
||
|
time error results.
|
||
|
|
||
|
Also note that you cannot use %array with C++ scanner
|
||
|
classes (the c++ option; see below).
|
||
|
|
||
|
ACTIONS
|
||
|
Each pattern in a rule has a corresponding action, which can
|
||
|
be any arbitrary C statement. The pattern ends at the first
|
||
|
non-escaped whitespace character; the remainder of the line
|
||
|
is its action. If the action is empty, then when the pat-
|
||
|
tern is matched the input token is simply discarded. For
|
||
|
example, here is the specification for a program which
|
||
|
deletes all occurrences of "zap me" from its input:
|
||
|
|
||
|
%%
|
||
|
"zap me"
|
||
|
|
||
|
(It will copy all other characters in the input to the out-
|
||
|
put since they will be matched by the default rule.)
|
||
|
|
||
|
Here is a program which compresses multiple blanks and tabs
|
||
|
down to a single blank, and throws away whitespace found at
|
||
|
the end of a line:
|
||
|
|
||
|
%%
|
||
|
[ \t]+ putchar( ' ' );
|
||
|
[ \t]+$ /* ignore this token */
|
||
|
|
||
|
|
||
|
If the action contains a '{', then the action spans till the
|
||
|
balancing '}' is found, and the action may cross multiple
|
||
|
lines. flex knows about C strings and comments and won't be
|
||
|
fooled by braces found within them, but also allows actions
|
||
|
to begin with %{ and will consider the action to be all the
|
||
|
text up to the next %} (regardless of ordinary braces inside
|
||
|
the action).
|
||
|
|
||
|
An action consisting solely of a vertical bar ('|') means
|
||
|
"same as the action for the next rule." See below for an
|
||
|
illustration.
|
||
|
|
||
|
Actions can include arbitrary C code, including return
|
||
|
statements to return a value to whatever routine called
|
||
|
yylex(). Each time yylex() is called it continues processing
|
||
|
tokens from where it last left off until it either reaches
|
||
|
the end of the file or executes a return.
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 11
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
Actions are free to modify yytext except for lengthening it
|
||
|
(adding characters to its end--these will overwrite later
|
||
|
characters in the input stream). This however does not
|
||
|
apply when using %array (see above); in that case, yytext
|
||
|
may be freely modified in any way.
|
||
|
|
||
|
Actions are free to modify yyleng except they should not do
|
||
|
so if the action also includes use of yymore() (see below).
|
||
|
|
||
|
There are a number of special directives which can be
|
||
|
included within an action:
|
||
|
|
||
|
- ECHO copies yytext to the scanner's output.
|
||
|
|
||
|
- BEGIN followed by the name of a start condition places
|
||
|
the scanner in the corresponding start condition (see
|
||
|
below).
|
||
|
|
||
|
- REJECT directs the scanner to proceed on to the "second
|
||
|
best" rule which matched the input (or a prefix of the
|
||
|
input). The rule is chosen as described above in "How
|
||
|
the Input is Matched", and yytext and yyleng set up
|
||
|
appropriately. It may either be one which matched as
|
||
|
much text as the originally chosen rule but came later
|
||
|
in the flex input file, or one which matched less text.
|
||
|
For example, the following will both count the words in
|
||
|
the input and call the routine special() whenever
|
||
|
"frob" is seen:
|
||
|
|
||
|
int word_count = 0;
|
||
|
%%
|
||
|
|
||
|
frob special(); REJECT;
|
||
|
[^ \t\n]+ ++word_count;
|
||
|
|
||
|
Without the REJECT, any "frob"'s in the input would not
|
||
|
be counted as words, since the scanner normally exe-
|
||
|
cutes only one action per token. Multiple REJECT's are
|
||
|
allowed, each one finding the next best choice to the
|
||
|
currently active rule. For example, when the following
|
||
|
scanner scans the token "abcd", it will write "abcdab-
|
||
|
caba" to the output:
|
||
|
|
||
|
%%
|
||
|
a |
|
||
|
ab |
|
||
|
abc |
|
||
|
abcd ECHO; REJECT;
|
||
|
.|\n /* eat up any unmatched character */
|
||
|
|
||
|
(The first three rules share the fourth's action since
|
||
|
they use the special '|' action.) REJECT is a
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 12
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
particularly expensive feature in terms of scanner per-
|
||
|
formance; if it is used in any of the scanner's actions
|
||
|
it will slow down all of the scanner's matching.
|
||
|
Furthermore, REJECT cannot be used with the -Cf or -CF
|
||
|
options (see below).
|
||
|
|
||
|
Note also that unlike the other special actions, REJECT
|
||
|
is a branch; code immediately following it in the
|
||
|
action will not be executed.
|
||
|
|
||
|
- yymore() tells the scanner that the next time it
|
||
|
matches a rule, the corresponding token should be
|
||
|
appended onto the current value of yytext rather than
|
||
|
replacing it. For example, given the input "mega-
|
||
|
kludge" the following will write "mega-mega-kludge" to
|
||
|
the output:
|
||
|
|
||
|
%%
|
||
|
mega- ECHO; yymore();
|
||
|
kludge ECHO;
|
||
|
|
||
|
First "mega-" is matched and echoed to the output.
|
||
|
Then "kludge" is matched, but the previous "mega-" is
|
||
|
still hanging around at the beginning of yytext so the
|
||
|
ECHO for the "kludge" rule will actually write "mega-
|
||
|
kludge".
|
||
|
|
||
|
Two notes regarding use of yymore(). First, yymore() depends
|
||
|
on the value of yyleng correctly reflecting the size of the
|
||
|
current token, so you must not modify yyleng if you are
|
||
|
using yymore(). Second, the presence of yymore() in the
|
||
|
scanner's action entails a minor performance penalty in the
|
||
|
scanner's matching speed.
|
||
|
|
||
|
- yyless(n) returns all but the first n characters of the
|
||
|
current token back to the input stream, where they will
|
||
|
be rescanned when the scanner looks for the next match.
|
||
|
yytext and yyleng are adjusted appropriately (e.g.,
|
||
|
yyleng will now be equal to n ). For example, on the
|
||
|
input "foobar" the following will write out "foobar-
|
||
|
bar":
|
||
|
|
||
|
%%
|
||
|
foobar ECHO; yyless(3);
|
||
|
[a-z]+ ECHO;
|
||
|
|
||
|
An argument of 0 to yyless will cause the entire
|
||
|
current input string to be scanned again. Unless
|
||
|
you've changed how the scanner will subsequently pro-
|
||
|
cess its input (using BEGIN, for example), this will
|
||
|
result in an endless loop.
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 13
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
Note that yyless is a macro and can only be used in the flex
|
||
|
input file, not from other source files.
|
||
|
|
||
|
- unput(c) puts the character c back onto the input
|
||
|
stream. It will be the next character scanned. The
|
||
|
following action will take the current token and cause
|
||
|
it to be rescanned enclosed in parentheses.
|
||
|
|
||
|
{
|
||
|
int i;
|
||
|
/* Copy yytext because unput() trashes yytext */
|
||
|
char *yycopy = strdup( yytext );
|
||
|
unput( ')' );
|
||
|
for ( i = yyleng - 1; i >= 0; --i )
|
||
|
unput( yycopy[i] );
|
||
|
unput( '(' );
|
||
|
free( yycopy );
|
||
|
}
|
||
|
|
||
|
Note that since each unput() puts the given character
|
||
|
back at the beginning of the input stream, pushing back
|
||
|
strings must be done back-to-front.
|
||
|
|
||
|
An important potential problem when using unput() is that if
|
||
|
you are using %pointer (the default), a call to unput() des-
|
||
|
troys the contents of yytext, starting with its rightmost
|
||
|
character and devouring one character to the left with each
|
||
|
call. If you need the value of yytext preserved after a
|
||
|
call to unput() (as in the above example), you must either
|
||
|
first copy it elsewhere, or build your scanner using %array
|
||
|
instead (see How The Input Is Matched).
|
||
|
|
||
|
Finally, note that you cannot put back EOF to attempt to
|
||
|
mark the input stream with an end-of-file.
|
||
|
|
||
|
- input() reads the next character from the input stream.
|
||
|
For example, the following is one way to eat up C com-
|
||
|
ments:
|
||
|
|
||
|
%%
|
||
|
"/*" {
|
||
|
register int c;
|
||
|
|
||
|
for ( ; ; )
|
||
|
{
|
||
|
while ( (c = input()) != '*' &&
|
||
|
c != EOF )
|
||
|
; /* eat up text of comment */
|
||
|
|
||
|
if ( c == '*' )
|
||
|
{
|
||
|
while ( (c = input()) == '*' )
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 14
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
;
|
||
|
if ( c == '/' )
|
||
|
break; /* found the end */
|
||
|
}
|
||
|
|
||
|
if ( c == EOF )
|
||
|
{
|
||
|
error( "EOF in comment" );
|
||
|
break;
|
||
|
}
|
||
|
}
|
||
|
}
|
||
|
|
||
|
(Note that if the scanner is compiled using C++, then
|
||
|
input() is instead referred to as yyinput(), in order
|
||
|
to avoid a name clash with the C++ stream by the name
|
||
|
of input.)
|
||
|
|
||
|
- YY_FLUSH_BUFFER flushes the scanner's internal buffer
|
||
|
so that the next time the scanner attempts to match a
|
||
|
token, it will first refill the buffer using YY_INPUT
|
||
|
(see The Generated Scanner, below). This action is a
|
||
|
special case of the more general yy_flush_buffer()
|
||
|
function, described below in the section Multiple Input
|
||
|
Buffers.
|
||
|
|
||
|
- yyterminate() can be used in lieu of a return statement
|
||
|
in an action. It terminates the scanner and returns a
|
||
|
0 to the scanner's caller, indicating "all done". By
|
||
|
default, yyterminate() is also called when an end-of-
|
||
|
file is encountered. It is a macro and may be rede-
|
||
|
fined.
|
||
|
|
||
|
THE GENERATED SCANNER
|
||
|
The output of flex is the file lex.yy.c, which contains the
|
||
|
scanning routine yylex(), a number of tables used by it for
|
||
|
matching tokens, and a number of auxiliary routines and mac-
|
||
|
ros. By default, yylex() is declared as follows:
|
||
|
|
||
|
int yylex()
|
||
|
{
|
||
|
... various definitions and the actions in here ...
|
||
|
}
|
||
|
|
||
|
(If your environment supports function prototypes, then it
|
||
|
will be "int yylex( void )".) This definition may be
|
||
|
changed by defining the "YY_DECL" macro. For example, you
|
||
|
could use:
|
||
|
|
||
|
#define YY_DECL float lexscan( a, b ) float a, b;
|
||
|
|
||
|
to give the scanning routine the name lexscan, returning a
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 15
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
float, and taking two floats as arguments. Note that if you
|
||
|
give arguments to the scanning routine using a K&R-
|
||
|
style/non-prototyped function declaration, you must ter-
|
||
|
minate the definition with a semi-colon (;).
|
||
|
|
||
|
Whenever yylex() is called, it scans tokens from the global
|
||
|
input file yyin (which defaults to stdin). It continues
|
||
|
until it either reaches an end-of-file (at which point it
|
||
|
returns the value 0) or one of its actions executes a return
|
||
|
statement.
|
||
|
|
||
|
If the scanner reaches an end-of-file, subsequent calls are
|
||
|
undefined unless either yyin is pointed at a new input file
|
||
|
(in which case scanning continues from that file), or yyres-
|
||
|
tart() is called. yyrestart() takes one argument, a FILE *
|
||
|
pointer (which can be nil, if you've set up YY_INPUT to scan
|
||
|
from a source other than yyin), and initializes yyin for
|
||
|
scanning from that file. Essentially there is no difference
|
||
|
between just assigning yyin to a new input file or using
|
||
|
yyrestart() to do so; the latter is available for compati-
|
||
|
bility with previous versions of flex, and because it can be
|
||
|
used to switch input files in the middle of scanning. It
|
||
|
can also be used to throw away the current input buffer, by
|
||
|
calling it with an argument of yyin; but better is to use
|
||
|
YY_FLUSH_BUFFER (see above). Note that yyrestart() does not
|
||
|
reset the start condition to INITIAL (see Start Conditions,
|
||
|
below).
|
||
|
|
||
|
If yylex() stops scanning due to executing a return state-
|
||
|
ment in one of the actions, the scanner may then be called
|
||
|
again and it will resume scanning where it left off.
|
||
|
|
||
|
By default (and for purposes of efficiency), the scanner
|
||
|
uses block-reads rather than simple getc() calls to read
|
||
|
characters from yyin. The nature of how it gets its input
|
||
|
can be controlled by defining the YY_INPUT macro.
|
||
|
YY_INPUT's calling sequence is
|
||
|
"YY_INPUT(buf,result,max_size)". Its action is to place up
|
||
|
to max_size characters in the character array buf and return
|
||
|
in the integer variable result either the number of charac-
|
||
|
ters read or the constant YY_NULL (0 on Unix systems) to
|
||
|
indicate EOF. The default YY_INPUT reads from the global
|
||
|
file-pointer "yyin".
|
||
|
|
||
|
A sample definition of YY_INPUT (in the definitions section
|
||
|
of the input file):
|
||
|
|
||
|
%{
|
||
|
#define YY_INPUT(buf,result,max_size) \
|
||
|
{ \
|
||
|
int c = getchar(); \
|
||
|
result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 16
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
}
|
||
|
%}
|
||
|
|
||
|
This definition will change the input processing to occur
|
||
|
one character at a time.
|
||
|
|
||
|
When the scanner receives an end-of-file indication from
|
||
|
YY_INPUT, it then checks the yywrap() function. If yywrap()
|
||
|
returns false (zero), then it is assumed that the function
|
||
|
has gone ahead and set up yyin to point to another input
|
||
|
file, and scanning continues. If it returns true (non-
|
||
|
zero), then the scanner terminates, returning 0 to its
|
||
|
caller. Note that in either case, the start condition
|
||
|
remains unchanged; it does not revert to INITIAL.
|
||
|
|
||
|
If you do not supply your own version of yywrap(), then you
|
||
|
must either use %option noyywrap (in which case the scanner
|
||
|
behaves as though yywrap() returned 1), or you must link
|
||
|
with -lfl to obtain the default version of the routine,
|
||
|
which always returns 1.
|
||
|
|
||
|
Three routines are available for scanning from in-memory
|
||
|
buffers rather than files: yy_scan_string(),
|
||
|
yy_scan_bytes(), and yy_scan_buffer(). See the discussion of
|
||
|
them below in the section Multiple Input Buffers.
|
||
|
|
||
|
The scanner writes its ECHO output to the yyout global
|
||
|
(default, stdout), which may be redefined by the user simply
|
||
|
by assigning it to some other FILE pointer.
|
||
|
|
||
|
START CONDITIONS
|
||
|
flex provides a mechanism for conditionally activating
|
||
|
rules. Any rule whose pattern is prefixed with "<sc>" will
|
||
|
only be active when the scanner is in the start condition
|
||
|
named "sc". For example,
|
||
|
|
||
|
<STRING>[^"]* { /* eat up the string body ... */
|
||
|
...
|
||
|
}
|
||
|
|
||
|
will be active only when the scanner is in the "STRING"
|
||
|
start condition, and
|
||
|
|
||
|
<INITIAL,STRING,QUOTE>\. { /* handle an escape ... */
|
||
|
...
|
||
|
}
|
||
|
|
||
|
will be active only when the current start condition is
|
||
|
either "INITIAL", "STRING", or "QUOTE".
|
||
|
|
||
|
Start conditions are declared in the definitions (first)
|
||
|
section of the input using unindented lines beginning with
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 17
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
either %s or %x followed by a list of names. The former
|
||
|
declares inclusive start conditions, the latter exclusive
|
||
|
start conditions. A start condition is activated using the
|
||
|
BEGIN action. Until the next BEGIN action is executed,
|
||
|
rules with the given start condition will be active and
|
||
|
rules with other start conditions will be inactive. If the
|
||
|
start condition is inclusive, then rules with no start con-
|
||
|
ditions at all will also be active. If it is exclusive,
|
||
|
then only rules qualified with the start condition will be
|
||
|
active. A set of rules contingent on the same exclusive
|
||
|
start condition describe a scanner which is independent of
|
||
|
any of the other rules in the flex input. Because of this,
|
||
|
exclusive start conditions make it easy to specify "mini-
|
||
|
scanners" which scan portions of the input that are syntac-
|
||
|
tically different from the rest (e.g., comments).
|
||
|
|
||
|
If the distinction between inclusive and exclusive start
|
||
|
conditions is still a little vague, here's a simple example
|
||
|
illustrating the connection between the two. The set of
|
||
|
rules:
|
||
|
|
||
|
%s example
|
||
|
%%
|
||
|
|
||
|
<example>foo do_something();
|
||
|
|
||
|
bar something_else();
|
||
|
|
||
|
is equivalent to
|
||
|
|
||
|
%x example
|
||
|
%%
|
||
|
|
||
|
<example>foo do_something();
|
||
|
|
||
|
<INITIAL,example>bar something_else();
|
||
|
|
||
|
Without the <INITIAL,example> qualifier, the bar pattern in
|
||
|
the second example wouldn't be active (i.e., couldn't match)
|
||
|
when in start condition example. If we just used <example>
|
||
|
to qualify bar, though, then it would only be active in
|
||
|
example and not in INITIAL, while in the first example it's
|
||
|
active in both, because in the first example the example
|
||
|
startion condition is an inclusive (%s) start condition.
|
||
|
|
||
|
Also note that the special start-condition specifier <*>
|
||
|
matches every start condition. Thus, the above example
|
||
|
could also have been written;
|
||
|
|
||
|
%x example
|
||
|
%%
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 18
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
<example>foo do_something();
|
||
|
|
||
|
<*>bar something_else();
|
||
|
|
||
|
|
||
|
The default rule (to ECHO any unmatched character) remains
|
||
|
active in start conditions. It is equivalent to:
|
||
|
|
||
|
<*>.|\n ECHO;
|
||
|
|
||
|
|
||
|
BEGIN(0) returns to the original state where only the rules
|
||
|
with no start conditions are active. This state can also be
|
||
|
referred to as the start-condition "INITIAL", so
|
||
|
BEGIN(INITIAL) is equivalent to BEGIN(0). (The parentheses
|
||
|
around the start condition name are not required but are
|
||
|
considered good style.)
|
||
|
|
||
|
BEGIN actions can also be given as indented code at the
|
||
|
beginning of the rules section. For example, the following
|
||
|
will cause the scanner to enter the "SPECIAL" start condi-
|
||
|
tion whenever yylex() is called and the global variable
|
||
|
enter_special is true:
|
||
|
|
||
|
int enter_special;
|
||
|
|
||
|
%x SPECIAL
|
||
|
%%
|
||
|
if ( enter_special )
|
||
|
BEGIN(SPECIAL);
|
||
|
|
||
|
<SPECIAL>blahblahblah
|
||
|
...more rules follow...
|
||
|
|
||
|
|
||
|
To illustrate the uses of start conditions, here is a
|
||
|
scanner which provides two different interpretations of a
|
||
|
string like "123.456". By default it will treat it as three
|
||
|
tokens, the integer "123", a dot ('.'), and the integer
|
||
|
"456". But if the string is preceded earlier in the line by
|
||
|
the string "expect-floats" it will treat it as a single
|
||
|
token, the floating-point number 123.456:
|
||
|
|
||
|
%{
|
||
|
#include <math.h>
|
||
|
%}
|
||
|
%s expect
|
||
|
|
||
|
%%
|
||
|
expect-floats BEGIN(expect);
|
||
|
|
||
|
<expect>[0-9]+"."[0-9]+ {
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 19
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
printf( "found a float, = %f\n",
|
||
|
atof( yytext ) );
|
||
|
}
|
||
|
<expect>\n {
|
||
|
/* that's the end of the line, so
|
||
|
* we need another "expect-number"
|
||
|
* before we'll recognize any more
|
||
|
* numbers
|
||
|
*/
|
||
|
BEGIN(INITIAL);
|
||
|
}
|
||
|
|
||
|
[0-9]+ {
|
||
|
printf( "found an integer, = %d\n",
|
||
|
atoi( yytext ) );
|
||
|
}
|
||
|
|
||
|
"." printf( "found a dot\n" );
|
||
|
|
||
|
Here is a scanner which recognizes (and discards) C comments
|
||
|
while maintaining a count of the current input line.
|
||
|
|
||
|
%x comment
|
||
|
%%
|
||
|
int line_num = 1;
|
||
|
|
||
|
"/*" BEGIN(comment);
|
||
|
|
||
|
<comment>[^*\n]* /* eat anything that's not a '*' */
|
||
|
<comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */
|
||
|
<comment>\n ++line_num;
|
||
|
<comment>"*"+"/" BEGIN(INITIAL);
|
||
|
|
||
|
This scanner goes to a bit of trouble to match as much text
|
||
|
as possible with each rule. In general, when attempting to
|
||
|
write a high-speed scanner try to match as much possible in
|
||
|
each rule, as it's a big win.
|
||
|
|
||
|
Note that start-conditions names are really integer values
|
||
|
and can be stored as such. Thus, the above could be
|
||
|
extended in the following fashion:
|
||
|
|
||
|
%x comment foo
|
||
|
%%
|
||
|
int line_num = 1;
|
||
|
int comment_caller;
|
||
|
|
||
|
"/*" {
|
||
|
comment_caller = INITIAL;
|
||
|
BEGIN(comment);
|
||
|
}
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 20
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
...
|
||
|
|
||
|
<foo>"/*" {
|
||
|
comment_caller = foo;
|
||
|
BEGIN(comment);
|
||
|
}
|
||
|
|
||
|
<comment>[^*\n]* /* eat anything that's not a '*' */
|
||
|
<comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */
|
||
|
<comment>\n ++line_num;
|
||
|
<comment>"*"+"/" BEGIN(comment_caller);
|
||
|
|
||
|
Furthermore, you can access the current start condition
|
||
|
using the integer-valued YY_START macro. For example, the
|
||
|
above assignments to comment_caller could instead be written
|
||
|
|
||
|
comment_caller = YY_START;
|
||
|
|
||
|
Flex provides YYSTATE as an alias for YY_START (since that
|
||
|
is what's used by AT&T lex).
|
||
|
|
||
|
Note that start conditions do not have their own name-space;
|
||
|
%s's and %x's declare names in the same fashion as
|
||
|
#define's.
|
||
|
|
||
|
Finally, here's an example of how to match C-style quoted
|
||
|
strings using exclusive start conditions, including expanded
|
||
|
escape sequences (but not including checking for a string
|
||
|
that's too long):
|
||
|
|
||
|
%x str
|
||
|
|
||
|
%%
|
||
|
char string_buf[MAX_STR_CONST];
|
||
|
char *string_buf_ptr;
|
||
|
|
||
|
|
||
|
\" string_buf_ptr = string_buf; BEGIN(str);
|
||
|
|
||
|
<str>\" { /* saw closing quote - all done */
|
||
|
BEGIN(INITIAL);
|
||
|
*string_buf_ptr = '\0';
|
||
|
/* return string constant token type and
|
||
|
* value to parser
|
||
|
*/
|
||
|
}
|
||
|
|
||
|
<str>\n {
|
||
|
/* error - unterminated string constant */
|
||
|
/* generate error message */
|
||
|
}
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 21
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
<str>\\[0-7]{1,3} {
|
||
|
/* octal escape sequence */
|
||
|
int result;
|
||
|
|
||
|
(void) sscanf( yytext + 1, "%o", &result );
|
||
|
|
||
|
if ( result > 0xff )
|
||
|
/* error, constant is out-of-bounds */
|
||
|
|
||
|
*string_buf_ptr++ = result;
|
||
|
}
|
||
|
|
||
|
<str>\\[0-9]+ {
|
||
|
/* generate error - bad escape sequence; something
|
||
|
* like '\48' or '\0777777'
|
||
|
*/
|
||
|
}
|
||
|
|
||
|
<str>\\n *string_buf_ptr++ = '\n';
|
||
|
<str>\\t *string_buf_ptr++ = '\t';
|
||
|
<str>\\r *string_buf_ptr++ = '\r';
|
||
|
<str>\\b *string_buf_ptr++ = '\b';
|
||
|
<str>\\f *string_buf_ptr++ = '\f';
|
||
|
|
||
|
<str>\\(.|\n) *string_buf_ptr++ = yytext[1];
|
||
|
|
||
|
<str>[^\\\n\"]+ {
|
||
|
char *yptr = yytext;
|
||
|
|
||
|
while ( *yptr )
|
||
|
*string_buf_ptr++ = *yptr++;
|
||
|
}
|
||
|
|
||
|
|
||
|
Often, such as in some of the examples above, you wind up
|
||
|
writing a whole bunch of rules all preceded by the same
|
||
|
start condition(s). Flex makes this a little easier and
|
||
|
cleaner by introducing a notion of start condition scope. A
|
||
|
start condition scope is begun with:
|
||
|
|
||
|
<SCs>{
|
||
|
|
||
|
where SCs is a list of one or more start conditions. Inside
|
||
|
the start condition scope, every rule automatically has the
|
||
|
prefix <SCs> applied to it, until a '}' which matches the
|
||
|
initial '{'. So, for example,
|
||
|
|
||
|
<ESC>{
|
||
|
"\\n" return '\n';
|
||
|
"\\r" return '\r';
|
||
|
"\\f" return '\f';
|
||
|
"\\0" return '\0';
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 22
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
}
|
||
|
|
||
|
is equivalent to:
|
||
|
|
||
|
<ESC>"\\n" return '\n';
|
||
|
<ESC>"\\r" return '\r';
|
||
|
<ESC>"\\f" return '\f';
|
||
|
<ESC>"\\0" return '\0';
|
||
|
|
||
|
Start condition scopes may be nested.
|
||
|
|
||
|
Three routines are available for manipulating stacks of
|
||
|
start conditions:
|
||
|
|
||
|
void yy_push_state(int new_state)
|
||
|
pushes the current start condition onto the top of the
|
||
|
start condition stack and switches to new_state as
|
||
|
though you had used BEGIN new_state (recall that start
|
||
|
condition names are also integers).
|
||
|
|
||
|
void yy_pop_state()
|
||
|
pops the top of the stack and switches to it via BEGIN.
|
||
|
|
||
|
int yy_top_state()
|
||
|
returns the top of the stack without altering the
|
||
|
stack's contents.
|
||
|
|
||
|
The start condition stack grows dynamically and so has no
|
||
|
built-in size limitation. If memory is exhausted, program
|
||
|
execution aborts.
|
||
|
|
||
|
To use start condition stacks, your scanner must include a
|
||
|
%option stack directive (see Options below).
|
||
|
|
||
|
MULTIPLE INPUT BUFFERS
|
||
|
Some scanners (such as those which support "include" files)
|
||
|
require reading from several input streams. As flex
|
||
|
scanners do a large amount of buffering, one cannot control
|
||
|
where the next input will be read from by simply writing a
|
||
|
YY_INPUT which is sensitive to the scanning context.
|
||
|
YY_INPUT is only called when the scanner reaches the end of
|
||
|
its buffer, which may be a long time after scanning a state-
|
||
|
ment such as an "include" which requires switching the input
|
||
|
source.
|
||
|
|
||
|
To negotiate these sorts of problems, flex provides a
|
||
|
mechanism for creating and switching between multiple input
|
||
|
buffers. An input buffer is created by using:
|
||
|
|
||
|
YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
|
||
|
|
||
|
which takes a FILE pointer and a size and creates a buffer
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 23
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
associated with the given file and large enough to hold size
|
||
|
characters (when in doubt, use YY_BUF_SIZE for the size).
|
||
|
It returns a YY_BUFFER_STATE handle, which may then be
|
||
|
passed to other routines (see below). The YY_BUFFER_STATE
|
||
|
type is a pointer to an opaque struct yy_buffer_state struc-
|
||
|
ture, so you may safely initialize YY_BUFFER_STATE variables
|
||
|
to ((YY_BUFFER_STATE) 0) if you wish, and also refer to the
|
||
|
opaque structure in order to correctly declare input buffers
|
||
|
in source files other than that of your scanner. Note that
|
||
|
the FILE pointer in the call to yy_create_buffer is only
|
||
|
used as the value of yyin seen by YY_INPUT; if you redefine
|
||
|
YY_INPUT so it no longer uses yyin, then you can safely pass
|
||
|
a nil FILE pointer to yy_create_buffer. You select a partic-
|
||
|
ular buffer to scan from using:
|
||
|
|
||
|
void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
|
||
|
|
||
|
switches the scanner's input buffer so subsequent tokens
|
||
|
will come from new_buffer. Note that yy_switch_to_buffer()
|
||
|
may be used by yywrap() to set things up for continued scan-
|
||
|
ning, instead of opening a new file and pointing yyin at it.
|
||
|
Note also that switching input sources via either
|
||
|
yy_switch_to_buffer() or yywrap() does not change the start
|
||
|
condition.
|
||
|
|
||
|
void yy_delete_buffer( YY_BUFFER_STATE buffer )
|
||
|
|
||
|
is used to reclaim the storage associated with a buffer. (
|
||
|
buffer can be nil, in which case the routine does nothing.)
|
||
|
You can also clear the current contents of a buffer using:
|
||
|
|
||
|
void yy_flush_buffer( YY_BUFFER_STATE buffer )
|
||
|
|
||
|
This function discards the buffer's contents, so the next
|
||
|
time the scanner attempts to match a token from the buffer,
|
||
|
it will first fill the buffer anew using YY_INPUT.
|
||
|
|
||
|
yy_new_buffer() is an alias for yy_create_buffer(), provided
|
||
|
for compatibility with the C++ use of new and delete for
|
||
|
creating and destroying dynamic objects.
|
||
|
|
||
|
Finally, the YY_CURRENT_BUFFER macro returns a
|
||
|
YY_BUFFER_STATE handle to the current buffer.
|
||
|
|
||
|
Here is an example of using these features for writing a
|
||
|
scanner which expands include files (the <<EOF>> feature is
|
||
|
discussed below):
|
||
|
|
||
|
/* the "incl" state is used for picking up the name
|
||
|
* of an include file
|
||
|
*/
|
||
|
%x incl
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 24
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
%{
|
||
|
#define MAX_INCLUDE_DEPTH 10
|
||
|
YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
|
||
|
int include_stack_ptr = 0;
|
||
|
%}
|
||
|
|
||
|
%%
|
||
|
include BEGIN(incl);
|
||
|
|
||
|
[a-z]+ ECHO;
|
||
|
[^a-z\n]*\n? ECHO;
|
||
|
|
||
|
<incl>[ \t]* /* eat the whitespace */
|
||
|
<incl>[^ \t\n]+ { /* got the include file name */
|
||
|
if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
|
||
|
{
|
||
|
fprintf( stderr, "Includes nested too deeply" );
|
||
|
exit( 1 );
|
||
|
}
|
||
|
|
||
|
include_stack[include_stack_ptr++] =
|
||
|
YY_CURRENT_BUFFER;
|
||
|
|
||
|
yyin = fopen( yytext, "r" );
|
||
|
|
||
|
if ( ! yyin )
|
||
|
error( ... );
|
||
|
|
||
|
yy_switch_to_buffer(
|
||
|
yy_create_buffer( yyin, YY_BUF_SIZE ) );
|
||
|
|
||
|
BEGIN(INITIAL);
|
||
|
}
|
||
|
|
||
|
<<EOF>> {
|
||
|
if ( --include_stack_ptr < 0 )
|
||
|
{
|
||
|
yyterminate();
|
||
|
}
|
||
|
|
||
|
else
|
||
|
{
|
||
|
yy_delete_buffer( YY_CURRENT_BUFFER );
|
||
|
yy_switch_to_buffer(
|
||
|
include_stack[include_stack_ptr] );
|
||
|
}
|
||
|
}
|
||
|
|
||
|
Three routines are available for setting up input buffers
|
||
|
for scanning in-memory strings instead of files. All of
|
||
|
them create a new input buffer for scanning the string, and
|
||
|
return a corresponding YY_BUFFER_STATE handle (which you
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 25
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
should delete with yy_delete_buffer() when done with it).
|
||
|
They also switch to the new buffer using
|
||
|
yy_switch_to_buffer(), so the next call to yylex() will
|
||
|
start scanning the string.
|
||
|
|
||
|
yy_scan_string(const char *str)
|
||
|
scans a NUL-terminated string.
|
||
|
|
||
|
yy_scan_bytes(const char *bytes, int len)
|
||
|
scans len bytes (including possibly NUL's) starting at
|
||
|
location bytes.
|
||
|
|
||
|
Note that both of these functions create and scan a copy of
|
||
|
the string or bytes. (This may be desirable, since yylex()
|
||
|
modifies the contents of the buffer it is scanning.) You
|
||
|
can avoid the copy by using:
|
||
|
|
||
|
yy_scan_buffer(char *base, yy_size_t size)
|
||
|
which scans in place the buffer starting at base, con-
|
||
|
sisting of size bytes, the last two bytes of which must
|
||
|
be YY_END_OF_BUFFER_CHAR (ASCII NUL). These last two
|
||
|
bytes are not scanned; thus, scanning consists of
|
||
|
base[0] through base[size-2], inclusive.
|
||
|
|
||
|
If you fail to set up base in this manner (i.e., forget
|
||
|
the final two YY_END_OF_BUFFER_CHAR bytes), then
|
||
|
yy_scan_buffer() returns a nil pointer instead of
|
||
|
creating a new input buffer.
|
||
|
|
||
|
The type yy_size_t is an integral type to which you can
|
||
|
cast an integer expression reflecting the size of the
|
||
|
buffer.
|
||
|
|
||
|
END-OF-FILE RULES
|
||
|
The special rule "<<EOF>>" indicates actions which are to be
|
||
|
taken when an end-of-file is encountered and yywrap()
|
||
|
returns non-zero (i.e., indicates no further files to pro-
|
||
|
cess). The action must finish by doing one of four things:
|
||
|
|
||
|
- assigning yyin to a new input file (in previous ver-
|
||
|
sions of flex, after doing the assignment you had to
|
||
|
call the special action YY_NEW_FILE; this is no longer
|
||
|
necessary);
|
||
|
|
||
|
- executing a return statement;
|
||
|
|
||
|
- executing the special yyterminate() action;
|
||
|
|
||
|
- or, switching to a new buffer using
|
||
|
yy_switch_to_buffer() as shown in the example above.
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 26
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
<<EOF>> rules may not be used with other patterns; they may
|
||
|
only be qualified with a list of start conditions. If an
|
||
|
unqualified <<EOF>> rule is given, it applies to all start
|
||
|
conditions which do not already have <<EOF>> actions. To
|
||
|
specify an <<EOF>> rule for only the initial start condi-
|
||
|
tion, use
|
||
|
|
||
|
<INITIAL><<EOF>>
|
||
|
|
||
|
|
||
|
These rules are useful for catching things like unclosed
|
||
|
comments. An example:
|
||
|
|
||
|
%x quote
|
||
|
%%
|
||
|
|
||
|
...other rules for dealing with quotes...
|
||
|
|
||
|
<quote><<EOF>> {
|
||
|
error( "unterminated quote" );
|
||
|
yyterminate();
|
||
|
}
|
||
|
<<EOF>> {
|
||
|
if ( *++filelist )
|
||
|
yyin = fopen( *filelist, "r" );
|
||
|
else
|
||
|
yyterminate();
|
||
|
}
|
||
|
|
||
|
|
||
|
MISCELLANEOUS MACROS
|
||
|
The macro YY_USER_ACTION can be defined to provide an action
|
||
|
which is always executed prior to the matched rule's action.
|
||
|
For example, it could be #define'd to call a routine to con-
|
||
|
vert yytext to lower-case. When YY_USER_ACTION is invoked,
|
||
|
the variable yy_act gives the number of the matched rule
|
||
|
(rules are numbered starting with 1). Suppose you want to
|
||
|
profile how often each of your rules is matched. The fol-
|
||
|
lowing would do the trick:
|
||
|
|
||
|
#define YY_USER_ACTION ++ctr[yy_act]
|
||
|
|
||
|
where ctr is an array to hold the counts for the different
|
||
|
rules. Note that the macro YY_NUM_RULES gives the total
|
||
|
number of rules (including the default rule, even if you use
|
||
|
-s), so a correct declaration for ctr is:
|
||
|
|
||
|
int ctr[YY_NUM_RULES];
|
||
|
|
||
|
|
||
|
The macro YY_USER_INIT may be defined to provide an action
|
||
|
which is always executed before the first scan (and before
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 27
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
the scanner's internal initializations are done). For exam-
|
||
|
ple, it could be used to call a routine to read in a data
|
||
|
table or open a logging file.
|
||
|
|
||
|
The macro yy_set_interactive(is_interactive) can be used to
|
||
|
control whether the current buffer is considered interac-
|
||
|
tive. An interactive buffer is processed more slowly, but
|
||
|
must be used when the scanner's input source is indeed
|
||
|
interactive to avoid problems due to waiting to fill buffers
|
||
|
(see the discussion of the -I flag below). A non-zero value
|
||
|
in the macro invocation marks the buffer as interactive, a
|
||
|
zero value as non-interactive. Note that use of this macro
|
||
|
overrides %option always-interactive or %option never-
|
||
|
interactive (see Options below). yy_set_interactive() must
|
||
|
be invoked prior to beginning to scan the buffer that is (or
|
||
|
is not) to be considered interactive.
|
||
|
|
||
|
The macro yy_set_bol(at_bol) can be used to control whether
|
||
|
the current buffer's scanning context for the next token
|
||
|
match is done as though at the beginning of a line. A non-
|
||
|
zero macro argument makes rules anchored with
|
||
|
|
||
|
The macro YY_AT_BOL() returns true if the next token scanned
|
||
|
from the current buffer will have '^' rules active, false
|
||
|
otherwise.
|
||
|
|
||
|
In the generated scanner, the actions are all gathered in
|
||
|
one large switch statement and separated using YY_BREAK,
|
||
|
which may be redefined. By default, it is simply a "break",
|
||
|
to separate each rule's action from the following rule's.
|
||
|
Redefining YY_BREAK allows, for example, C++ users to
|
||
|
#define YY_BREAK to do nothing (while being very careful
|
||
|
that every rule ends with a "break" or a "return"!) to avoid
|
||
|
suffering from unreachable statement warnings where because
|
||
|
a rule's action ends with "return", the YY_BREAK is inacces-
|
||
|
sible.
|
||
|
|
||
|
VALUES AVAILABLE TO THE USER
|
||
|
This section summarizes the various values available to the
|
||
|
user in the rule actions.
|
||
|
|
||
|
- char *yytext holds the text of the current token. It
|
||
|
may be modified but not lengthened (you cannot append
|
||
|
characters to the end).
|
||
|
|
||
|
If the special directive %array appears in the first
|
||
|
section of the scanner description, then yytext is
|
||
|
instead declared char yytext[YYLMAX], where YYLMAX is a
|
||
|
macro definition that you can redefine in the first
|
||
|
section if you don't like the default value (generally
|
||
|
8KB). Using %array results in somewhat slower
|
||
|
scanners, but the value of yytext becomes immune to
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 28
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
calls to input() and unput(), which potentially destroy
|
||
|
its value when yytext is a character pointer. The
|
||
|
opposite of %array is %pointer, which is the default.
|
||
|
|
||
|
You cannot use %array when generating C++ scanner
|
||
|
classes (the -+ flag).
|
||
|
|
||
|
- int yyleng holds the length of the current token.
|
||
|
|
||
|
- FILE *yyin is the file which by default flex reads
|
||
|
from. It may be redefined but doing so only makes
|
||
|
sense before scanning begins or after an EOF has been
|
||
|
encountered. Changing it in the midst of scanning will
|
||
|
have unexpected results since flex buffers its input;
|
||
|
use yyrestart() instead. Once scanning terminates
|
||
|
because an end-of-file has been seen, you can assign
|
||
|
yyin at the new input file and then call the scanner
|
||
|
again to continue scanning.
|
||
|
|
||
|
- void yyrestart( FILE *new_file ) may be called to point
|
||
|
yyin at the new input file. The switch-over to the new
|
||
|
file is immediate (any previously buffered-up input is
|
||
|
lost). Note that calling yyrestart() with yyin as an
|
||
|
argument thus throws away the current input buffer and
|
||
|
continues scanning the same input file.
|
||
|
|
||
|
- FILE *yyout is the file to which ECHO actions are done.
|
||
|
It can be reassigned by the user.
|
||
|
|
||
|
- YY_CURRENT_BUFFER returns a YY_BUFFER_STATE handle to
|
||
|
the current buffer.
|
||
|
|
||
|
- YY_START returns an integer value corresponding to the
|
||
|
current start condition. You can subsequently use this
|
||
|
value with BEGIN to return to that start condition.
|
||
|
|
||
|
INTERFACING WITH YACC
|
||
|
One of the main uses of flex is as a companion to the yacc
|
||
|
parser-generator. yacc parsers expect to call a routine
|
||
|
named yylex() to find the next input token. The routine is
|
||
|
supposed to return the type of the next token as well as
|
||
|
putting any associated value in the global yylval. To use
|
||
|
flex with yacc, one specifies the -d option to yacc to
|
||
|
instruct it to generate the file y.tab.h containing defini-
|
||
|
tions of all the %tokens appearing in the yacc input. This
|
||
|
file is then included in the flex scanner. For example, if
|
||
|
one of the tokens is "TOK_NUMBER", part of the scanner might
|
||
|
look like:
|
||
|
|
||
|
%{
|
||
|
#include "y.tab.h"
|
||
|
%}
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 29
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
%%
|
||
|
|
||
|
[0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
|
||
|
|
||
|
|
||
|
OPTIONS
|
||
|
flex has the following options:
|
||
|
|
||
|
-b Generate backing-up information to lex.backup. This is
|
||
|
a list of scanner states which require backing up and
|
||
|
the input characters on which they do so. By adding
|
||
|
rules one can remove backing-up states. If all
|
||
|
backing-up states are eliminated and -Cf or -CF is
|
||
|
used, the generated scanner will run faster (see the -p
|
||
|
flag). Only users who wish to squeeze every last cycle
|
||
|
out of their scanners need worry about this option.
|
||
|
(See the section on Performance Considerations below.)
|
||
|
|
||
|
-c is a do-nothing, deprecated option included for POSIX
|
||
|
compliance.
|
||
|
|
||
|
-d makes the generated scanner run in debug mode. When-
|
||
|
ever a pattern is recognized and the global
|
||
|
yy_flex_debug is non-zero (which is the default), the
|
||
|
scanner will write to stderr a line of the form:
|
||
|
|
||
|
--accepting rule at line 53 ("the matched text")
|
||
|
|
||
|
The line number refers to the location of the rule in
|
||
|
the file defining the scanner (i.e., the file that was
|
||
|
fed to flex). Messages are also generated when the
|
||
|
scanner backs up, accepts the default rule, reaches the
|
||
|
end of its input buffer (or encounters a NUL; at this
|
||
|
point, the two look the same as far as the scanner's
|
||
|
concerned), or reaches an end-of-file.
|
||
|
|
||
|
-f specifies fast scanner. No table compression is done
|
||
|
and stdio is bypassed. The result is large but fast.
|
||
|
This option is equivalent to -Cfr (see below).
|
||
|
|
||
|
-h generates a "help" summary of flex's options to stdout
|
||
|
and then exits. -? and --help are synonyms for -h.
|
||
|
|
||
|
-i instructs flex to generate a case-insensitive scanner.
|
||
|
The case of letters given in the flex input patterns
|
||
|
will be ignored, and tokens in the input will be
|
||
|
matched regardless of case. The matched text given in
|
||
|
yytext will have the preserved case (i.e., it will not
|
||
|
be folded).
|
||
|
|
||
|
-l turns on maximum compatibility with the original AT&T
|
||
|
lex implementation. Note that this does not mean full
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 30
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
compatibility. Use of this option costs a considerable
|
||
|
amount of performance, and it cannot be used with the
|
||
|
-+, -f, -F, -Cf, or -CF options. For details on the
|
||
|
compatibilities it provides, see the section "Incompa-
|
||
|
tibilities With Lex And POSIX" below. This option also
|
||
|
results in the name YY_FLEX_LEX_COMPAT being #define'd
|
||
|
in the generated scanner.
|
||
|
|
||
|
-n is another do-nothing, deprecated option included only
|
||
|
for POSIX compliance.
|
||
|
|
||
|
-p generates a performance report to stderr. The report
|
||
|
consists of comments regarding features of the flex
|
||
|
input file which will cause a serious loss of perfor-
|
||
|
mance in the resulting scanner. If you give the flag
|
||
|
twice, you will also get comments regarding features
|
||
|
that lead to minor performance losses.
|
||
|
|
||
|
Note that the use of REJECT, %option yylineno, and
|
||
|
variable trailing context (see the Deficiencies / Bugs
|
||
|
section below) entails a substantial performance
|
||
|
penalty; use of yymore(), the ^ operator, and the -I
|
||
|
flag entail minor performance penalties.
|
||
|
|
||
|
-s causes the default rule (that unmatched scanner input
|
||
|
is echoed to stdout) to be suppressed. If the scanner
|
||
|
encounters input that does not match any of its rules,
|
||
|
it aborts with an error. This option is useful for
|
||
|
finding holes in a scanner's rule set.
|
||
|
|
||
|
-t instructs flex to write the scanner it generates to
|
||
|
standard output instead of lex.yy.c.
|
||
|
|
||
|
-v specifies that flex should write to stderr a summary of
|
||
|
statistics regarding the scanner it generates. Most of
|
||
|
the statistics are meaningless to the casual flex user,
|
||
|
but the first line identifies the version of flex (same
|
||
|
as reported by -V), and the next line the flags used
|
||
|
when generating the scanner, including those that are
|
||
|
on by default.
|
||
|
|
||
|
-w suppresses warning messages.
|
||
|
|
||
|
-B instructs flex to generate a batch scanner, the oppo-
|
||
|
site of interactive scanners generated by -I (see
|
||
|
below). In general, you use -B when you are certain
|
||
|
that your scanner will never be used interactively, and
|
||
|
you want to squeeze a little more performance out of
|
||
|
it. If your goal is instead to squeeze out a lot more
|
||
|
performance, you should be using the -Cf or -CF
|
||
|
options (discussed below), which turn on -B automati-
|
||
|
cally anyway.
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 31
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
-F specifies that the fast scanner table representation
|
||
|
should be used (and stdio bypassed). This representa-
|
||
|
tion is about as fast as the full table representation
|
||
|
(-f), and for some sets of patterns will be consider-
|
||
|
ably smaller (and for others, larger). In general, if
|
||
|
the pattern set contains both "keywords" and a catch-
|
||
|
all, "identifier" rule, such as in the set:
|
||
|
|
||
|
"case" return TOK_CASE;
|
||
|
"switch" return TOK_SWITCH;
|
||
|
...
|
||
|
"default" return TOK_DEFAULT;
|
||
|
[a-z]+ return TOK_ID;
|
||
|
|
||
|
then you're better off using the full table representa-
|
||
|
tion. If only the "identifier" rule is present and you
|
||
|
then use a hash table or some such to detect the key-
|
||
|
words, you're better off using -F.
|
||
|
|
||
|
This option is equivalent to -CFr (see below). It can-
|
||
|
not be used with -+.
|
||
|
|
||
|
-I instructs flex to generate an interactive scanner. An
|
||
|
interactive scanner is one that only looks ahead to
|
||
|
decide what token has been matched if it absolutely
|
||
|
must. It turns out that always looking one extra char-
|
||
|
acter ahead, even if the scanner has already seen
|
||
|
enough text to disambiguate the current token, is a bit
|
||
|
faster than only looking ahead when necessary. But
|
||
|
scanners that always look ahead give dreadful interac-
|
||
|
tive performance; for example, when a user types a new-
|
||
|
line, it is not recognized as a newline token until
|
||
|
they enter another token, which often means typing in
|
||
|
another whole line.
|
||
|
|
||
|
Flex scanners default to interactive unless you use the
|
||
|
-Cf or -CF table-compression options (see below).
|
||
|
That's because if you're looking for high-performance
|
||
|
you should be using one of these options, so if you
|
||
|
didn't, flex assumes you'd rather trade off a bit of
|
||
|
run-time performance for intuitive interactive
|
||
|
behavior. Note also that you cannot use -I in conjunc-
|
||
|
tion with -Cf or -CF. Thus, this option is not really
|
||
|
needed; it is on by default for all those cases in
|
||
|
which it is allowed.
|
||
|
|
||
|
You can force a scanner to not be interactive by using
|
||
|
-B (see above).
|
||
|
|
||
|
-L instructs flex not to generate #line directives.
|
||
|
Without this option, flex peppers the generated scanner
|
||
|
with #line directives so error messages in the actions
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 32
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
will be correctly located with respect to either the
|
||
|
original flex input file (if the errors are due to code
|
||
|
in the input file), or lex.yy.c (if the errors are
|
||
|
flex's fault -- you should report these sorts of errors
|
||
|
to the email address given below).
|
||
|
|
||
|
-T makes flex run in trace mode. It will generate a lot
|
||
|
of messages to stderr concerning the form of the input
|
||
|
and the resultant non-deterministic and deterministic
|
||
|
finite automata. This option is mostly for use in
|
||
|
maintaining flex.
|
||
|
|
||
|
-V prints the version number to stdout and exits. --ver-
|
||
|
sion is a synonym for -V.
|
||
|
|
||
|
-7 instructs flex to generate a 7-bit scanner, i.e., one
|
||
|
which can only recognized 7-bit characters in its
|
||
|
input. The advantage of using -7 is that the scanner's
|
||
|
tables can be up to half the size of those generated
|
||
|
using the -8 option (see below). The disadvantage is
|
||
|
that such scanners often hang or crash if their input
|
||
|
contains an 8-bit character.
|
||
|
|
||
|
Note, however, that unless you generate your scanner
|
||
|
using the -Cf or -CF table compression options, use of
|
||
|
-7 will save only a small amount of table space, and
|
||
|
make your scanner considerably less portable. Flex's
|
||
|
default behavior is to generate an 8-bit scanner unless
|
||
|
you use the -Cf or -CF, in which case flex defaults to
|
||
|
generating 7-bit scanners unless your site was always
|
||
|
configured to generate 8-bit scanners (as will often be
|
||
|
the case with non-USA sites). You can tell whether
|
||
|
flex generated a 7-bit or an 8-bit scanner by inspect-
|
||
|
ing the flag summary in the -v output as described
|
||
|
above.
|
||
|
|
||
|
Note that if you use -Cfe or -CFe (those table compres-
|
||
|
sion options, but also using equivalence classes as
|
||
|
discussed see below), flex still defaults to generating
|
||
|
an 8-bit scanner, since usually with these compression
|
||
|
options full 8-bit tables are not much more expensive
|
||
|
than 7-bit tables.
|
||
|
|
||
|
-8 instructs flex to generate an 8-bit scanner, i.e., one
|
||
|
which can recognize 8-bit characters. This flag is
|
||
|
only needed for scanners generated using -Cf or -CF, as
|
||
|
otherwise flex defaults to generating an 8-bit scanner
|
||
|
anyway.
|
||
|
|
||
|
See the discussion of -7 above for flex's default
|
||
|
behavior and the tradeoffs between 7-bit and 8-bit
|
||
|
scanners.
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 33
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
-+ specifies that you want flex to generate a C++ scanner
|
||
|
class. See the section on Generating C++ Scanners
|
||
|
below for details.
|
||
|
|
||
|
-C[aefFmr]
|
||
|
controls the degree of table compression and, more gen-
|
||
|
erally, trade-offs between small scanners and fast
|
||
|
scanners.
|
||
|
|
||
|
-Ca ("align") instructs flex to trade off larger tables
|
||
|
in the generated scanner for faster performance because
|
||
|
the elements of the tables are better aligned for
|
||
|
memory access and computation. On some RISC architec-
|
||
|
tures, fetching and manipulating longwords is more
|
||
|
efficient than with smaller-sized units such as short-
|
||
|
words. This option can double the size of the tables
|
||
|
used by your scanner.
|
||
|
|
||
|
-Ce directs flex to construct equivalence classes,
|
||
|
i.e., sets of characters which have identical lexical
|
||
|
properties (for example, if the only appearance of
|
||
|
digits in the flex input is in the character class
|
||
|
"[0-9]" then the digits '0', '1', ..., '9' will all be
|
||
|
put in the same equivalence class). Equivalence
|
||
|
classes usually give dramatic reductions in the final
|
||
|
table/object file sizes (typically a factor of 2-5) and
|
||
|
are pretty cheap performance-wise (one array look-up
|
||
|
per character scanned).
|
||
|
|
||
|
-Cf specifies that the full scanner tables should be
|
||
|
generated - flex should not compress the tables by tak-
|
||
|
ing advantages of similar transition functions for dif-
|
||
|
ferent states.
|
||
|
|
||
|
-CF specifies that the alternate fast scanner represen-
|
||
|
tation (described above under the -F flag) should be
|
||
|
used. This option cannot be used with -+.
|
||
|
|
||
|
-Cm directs flex to construct meta-equivalence classes,
|
||
|
which are sets of equivalence classes (or characters,
|
||
|
if equivalence classes are not being used) that are
|
||
|
commonly used together. Meta-equivalence classes are
|
||
|
often a big win when using compressed tables, but they
|
||
|
have a moderate performance impact (one or two "if"
|
||
|
tests and one array look-up per character scanned).
|
||
|
|
||
|
-Cr causes the generated scanner to bypass use of the
|
||
|
standard I/O library (stdio) for input. Instead of
|
||
|
calling fread() or getc(), the scanner will use the
|
||
|
read() system call, resulting in a performance gain
|
||
|
which varies from system to system, but in general is
|
||
|
probably negligible unless you are also using -Cf or
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 34
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
-CF. Using -Cr can cause strange behavior if, for exam-
|
||
|
ple, you read from yyin using stdio prior to calling
|
||
|
the scanner (because the scanner will miss whatever
|
||
|
text your previous reads left in the stdio input
|
||
|
buffer).
|
||
|
|
||
|
-Cr has no effect if you define YY_INPUT (see The Gen-
|
||
|
erated Scanner above).
|
||
|
|
||
|
A lone -C specifies that the scanner tables should be
|
||
|
compressed but neither equivalence classes nor meta-
|
||
|
equivalence classes should be used.
|
||
|
|
||
|
The options -Cf or -CF and -Cm do not make sense
|
||
|
together - there is no opportunity for meta-equivalence
|
||
|
classes if the table is not being compressed. Other-
|
||
|
wise the options may be freely mixed, and are cumula-
|
||
|
tive.
|
||
|
|
||
|
The default setting is -Cem, which specifies that flex
|
||
|
should generate equivalence classes and meta-
|
||
|
equivalence classes. This setting provides the highest
|
||
|
degree of table compression. You can trade off
|
||
|
faster-executing scanners at the cost of larger tables
|
||
|
with the following generally being true:
|
||
|
|
||
|
slowest & smallest
|
||
|
-Cem
|
||
|
-Cm
|
||
|
-Ce
|
||
|
-C
|
||
|
-C{f,F}e
|
||
|
-C{f,F}
|
||
|
-C{f,F}a
|
||
|
fastest & largest
|
||
|
|
||
|
Note that scanners with the smallest tables are usually
|
||
|
generated and compiled the quickest, so during develop-
|
||
|
ment you will usually want to use the default, maximal
|
||
|
compression.
|
||
|
|
||
|
-Cfe is often a good compromise between speed and size
|
||
|
for production scanners.
|
||
|
|
||
|
-ooutput
|
||
|
directs flex to write the scanner to the file output
|
||
|
instead of lex.yy.c. If you combine -o with the -t
|
||
|
option, then the scanner is written to stdout but its
|
||
|
#line directives (see the -L option above) refer to the
|
||
|
file output.
|
||
|
|
||
|
-Pprefix
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 35
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
changes the default yy prefix used by flex for all
|
||
|
globally-visible variable and function names to instead
|
||
|
be prefix. For example, -Pfoo changes the name of
|
||
|
yytext to footext. It also changes the name of the
|
||
|
default output file from lex.yy.c to lex.foo.c. Here
|
||
|
are all of the names affected:
|
||
|
|
||
|
yy_create_buffer
|
||
|
yy_delete_buffer
|
||
|
yy_flex_debug
|
||
|
yy_init_buffer
|
||
|
yy_flush_buffer
|
||
|
yy_load_buffer_state
|
||
|
yy_switch_to_buffer
|
||
|
yyin
|
||
|
yyleng
|
||
|
yylex
|
||
|
yylineno
|
||
|
yyout
|
||
|
yyrestart
|
||
|
yytext
|
||
|
yywrap
|
||
|
|
||
|
(If you are using a C++ scanner, then only yywrap and
|
||
|
yyFlexLexer are affected.) Within your scanner itself,
|
||
|
you can still refer to the global variables and func-
|
||
|
tions using either version of their name; but exter-
|
||
|
nally, they have the modified name.
|
||
|
|
||
|
This option lets you easily link together multiple flex
|
||
|
programs into the same executable. Note, though, that
|
||
|
using this option also renames yywrap(), so you now
|
||
|
must either provide your own (appropriately-named) ver-
|
||
|
sion of the routine for your scanner, or use %option
|
||
|
noyywrap, as linking with -lfl no longer provides one
|
||
|
for you by default.
|
||
|
|
||
|
-Sskeleton_file
|
||
|
overrides the default skeleton file from which flex
|
||
|
constructs its scanners. You'll never need this option
|
||
|
unless you are doing flex maintenance or development.
|
||
|
|
||
|
flex also provides a mechanism for controlling options
|
||
|
within the scanner specification itself, rather than from
|
||
|
the flex command-line. This is done by including %option
|
||
|
directives in the first section of the scanner specifica-
|
||
|
tion. You can specify multiple options with a single
|
||
|
%option directive, and multiple directives in the first sec-
|
||
|
tion of your flex input file.
|
||
|
|
||
|
Most options are given simply as names, optionally preceded
|
||
|
by the word "no" (with no intervening whitespace) to negate
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 36
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
their meaning. A number are equivalent to flex flags or
|
||
|
their negation:
|
||
|
|
||
|
7bit -7 option
|
||
|
8bit -8 option
|
||
|
align -Ca option
|
||
|
backup -b option
|
||
|
batch -B option
|
||
|
c++ -+ option
|
||
|
|
||
|
caseful or
|
||
|
case-sensitive opposite of -i (default)
|
||
|
|
||
|
case-insensitive or
|
||
|
caseless -i option
|
||
|
|
||
|
debug -d option
|
||
|
default opposite of -s option
|
||
|
ecs -Ce option
|
||
|
fast -F option
|
||
|
full -f option
|
||
|
interactive -I option
|
||
|
lex-compat -l option
|
||
|
meta-ecs -Cm option
|
||
|
perf-report -p option
|
||
|
read -Cr option
|
||
|
stdout -t option
|
||
|
verbose -v option
|
||
|
warn opposite of -w option
|
||
|
(use "%option nowarn" for -w)
|
||
|
|
||
|
array equivalent to "%array"
|
||
|
pointer equivalent to "%pointer" (default)
|
||
|
|
||
|
Some %option's provide features otherwise not available:
|
||
|
|
||
|
always-interactive
|
||
|
instructs flex to generate a scanner which always con-
|
||
|
siders its input "interactive". Normally, on each new
|
||
|
input file the scanner calls isatty() in an attempt to
|
||
|
determine whether the scanner's input source is
|
||
|
interactive and thus should be read a character at a
|
||
|
time. When this option is used, however, then no such
|
||
|
call is made.
|
||
|
|
||
|
main directs flex to provide a default main() program for
|
||
|
the scanner, which simply calls yylex(). This option
|
||
|
implies noyywrap (see below).
|
||
|
|
||
|
never-interactive
|
||
|
instructs flex to generate a scanner which never con-
|
||
|
siders its input "interactive" (again, no call made to
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 37
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
isatty()). This is the opposite of always-interactive.
|
||
|
|
||
|
stack
|
||
|
enables the use of start condition stacks (see Start
|
||
|
Conditions above).
|
||
|
|
||
|
stdinit
|
||
|
if set (i.e., %option stdinit) initializes yyin and
|
||
|
yyout to stdin and stdout, instead of the default of
|
||
|
nil. Some existing lex programs depend on this
|
||
|
behavior, even though it is not compliant with ANSI C,
|
||
|
which does not require stdin and stdout to be compile-
|
||
|
time constant.
|
||
|
|
||
|
yylineno
|
||
|
directs flex to generate a scanner that maintains the
|
||
|
number of the current line read from its input in the
|
||
|
global variable yylineno. This option is implied by
|
||
|
%option lex-compat.
|
||
|
|
||
|
yywrap
|
||
|
if unset (i.e., %option noyywrap), makes the scanner
|
||
|
not call yywrap() upon an end-of-file, but simply
|
||
|
assume that there are no more files to scan (until the
|
||
|
user points yyin at a new file and calls yylex()
|
||
|
again).
|
||
|
|
||
|
flex scans your rule actions to determine whether you use
|
||
|
the REJECT or yymore() features. The reject and yymore
|
||
|
options are available to override its decision as to whether
|
||
|
you use the options, either by setting them (e.g., %option
|
||
|
reject) to indicate the feature is indeed used, or unsetting
|
||
|
them to indicate it actually is not used (e.g., %option
|
||
|
noyymore).
|
||
|
|
||
|
Three options take string-delimited values, offset with '=':
|
||
|
|
||
|
%option outfile="ABC"
|
||
|
|
||
|
is equivalent to -oABC, and
|
||
|
|
||
|
%option prefix="XYZ"
|
||
|
|
||
|
is equivalent to -PXYZ. Finally,
|
||
|
|
||
|
%option yyclass="foo"
|
||
|
|
||
|
only applies when generating a C++ scanner ( -+ option). It
|
||
|
informs flex that you have derived foo as a subclass of
|
||
|
yyFlexLexer, so flex will place your actions in the member
|
||
|
function foo::yylex() instead of yyFlexLexer::yylex(). It
|
||
|
also generates a yyFlexLexer::yylex() member function that
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 38
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
emits a run-time error (by invoking
|
||
|
yyFlexLexer::LexerError()) if called. See Generating C++
|
||
|
Scanners, below, for additional information.
|
||
|
|
||
|
A number of options are available for lint purists who want
|
||
|
to suppress the appearance of unneeded routines in the gen-
|
||
|
erated scanner. Each of the following, if unset (e.g.,
|
||
|
%option nounput ), results in the corresponding routine not
|
||
|
appearing in the generated scanner:
|
||
|
|
||
|
input, unput
|
||
|
yy_push_state, yy_pop_state, yy_top_state
|
||
|
yy_scan_buffer, yy_scan_bytes, yy_scan_string
|
||
|
|
||
|
(though yy_push_state() and friends won't appear anyway
|
||
|
unless you use %option stack).
|
||
|
|
||
|
PERFORMANCE CONSIDERATIONS
|
||
|
The main design goal of flex is that it generate high-
|
||
|
performance scanners. It has been optimized for dealing
|
||
|
well with large sets of rules. Aside from the effects on
|
||
|
scanner speed of the table compression -C options outlined
|
||
|
above, there are a number of options/actions which degrade
|
||
|
performance. These are, from most expensive to least:
|
||
|
|
||
|
REJECT
|
||
|
%option yylineno
|
||
|
arbitrary trailing context
|
||
|
|
||
|
pattern sets that require backing up
|
||
|
%array
|
||
|
%option interactive
|
||
|
%option always-interactive
|
||
|
|
||
|
'^' beginning-of-line operator
|
||
|
yymore()
|
||
|
|
||
|
with the first three all being quite expensive and the last
|
||
|
two being quite cheap. Note also that unput() is imple-
|
||
|
mented as a routine call that potentially does quite a bit
|
||
|
of work, while yyless() is a quite-cheap macro; so if just
|
||
|
putting back some excess text you scanned, use yyless().
|
||
|
|
||
|
REJECT should be avoided at all costs when performance is
|
||
|
important. It is a particularly expensive option.
|
||
|
|
||
|
Getting rid of backing up is messy and often may be an enor-
|
||
|
mous amount of work for a complicated scanner. In princi-
|
||
|
pal, one begins by using the -b flag to generate a
|
||
|
lex.backup file. For example, on the input
|
||
|
|
||
|
%%
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 39
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
foo return TOK_KEYWORD;
|
||
|
foobar return TOK_KEYWORD;
|
||
|
|
||
|
the file looks like:
|
||
|
|
||
|
State #6 is non-accepting -
|
||
|
associated rule line numbers:
|
||
|
2 3
|
||
|
out-transitions: [ o ]
|
||
|
jam-transitions: EOF [ \001-n p-\177 ]
|
||
|
|
||
|
State #8 is non-accepting -
|
||
|
associated rule line numbers:
|
||
|
3
|
||
|
out-transitions: [ a ]
|
||
|
jam-transitions: EOF [ \001-` b-\177 ]
|
||
|
|
||
|
State #9 is non-accepting -
|
||
|
associated rule line numbers:
|
||
|
3
|
||
|
out-transitions: [ r ]
|
||
|
jam-transitions: EOF [ \001-q s-\177 ]
|
||
|
|
||
|
Compressed tables always back up.
|
||
|
|
||
|
The first few lines tell us that there's a scanner state in
|
||
|
which it can make a transition on an 'o' but not on any
|
||
|
other character, and that in that state the currently
|
||
|
scanned text does not match any rule. The state occurs when
|
||
|
trying to match the rules found at lines 2 and 3 in the
|
||
|
input file. If the scanner is in that state and then reads
|
||
|
something other than an 'o', it will have to back up to find
|
||
|
a rule which is matched. With a bit of headscratching one
|
||
|
can see that this must be the state it's in when it has seen
|
||
|
"fo". When this has happened, if anything other than
|
||
|
another 'o' is seen, the scanner will have to back up to
|
||
|
simply match the 'f' (by the default rule).
|
||
|
|
||
|
The comment regarding State #8 indicates there's a problem
|
||
|
when "foob" has been scanned. Indeed, on any character
|
||
|
other than an 'a', the scanner will have to back up to
|
||
|
accept "foo". Similarly, the comment for State #9 concerns
|
||
|
when "fooba" has been scanned and an 'r' does not follow.
|
||
|
|
||
|
The final comment reminds us that there's no point going to
|
||
|
all the trouble of removing backing up from the rules unless
|
||
|
we're using -Cf or -CF, since there's no performance gain
|
||
|
doing so with compressed scanners.
|
||
|
|
||
|
The way to remove the backing up is to add "error" rules:
|
||
|
|
||
|
%%
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 40
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
foo return TOK_KEYWORD;
|
||
|
foobar return TOK_KEYWORD;
|
||
|
|
||
|
fooba |
|
||
|
foob |
|
||
|
fo {
|
||
|
/* false alarm, not really a keyword */
|
||
|
return TOK_ID;
|
||
|
}
|
||
|
|
||
|
|
||
|
Eliminating backing up among a list of keywords can also be
|
||
|
done using a "catch-all" rule:
|
||
|
|
||
|
%%
|
||
|
foo return TOK_KEYWORD;
|
||
|
foobar return TOK_KEYWORD;
|
||
|
|
||
|
[a-z]+ return TOK_ID;
|
||
|
|
||
|
This is usually the best solution when appropriate.
|
||
|
|
||
|
Backing up messages tend to cascade. With a complicated set
|
||
|
of rules it's not uncommon to get hundreds of messages. If
|
||
|
one can decipher them, though, it often only takes a dozen
|
||
|
or so rules to eliminate the backing up (though it's easy to
|
||
|
make a mistake and have an error rule accidentally match a
|
||
|
valid token. A possible future flex feature will be to
|
||
|
automatically add rules to eliminate backing up).
|
||
|
|
||
|
It's important to keep in mind that you gain the benefits of
|
||
|
eliminating backing up only if you eliminate every instance
|
||
|
of backing up. Leaving just one means you gain nothing.
|
||
|
|
||
|
Variable trailing context (where both the leading and trail-
|
||
|
ing parts do not have a fixed length) entails almost the
|
||
|
same performance loss as REJECT (i.e., substantial). So
|
||
|
when possible a rule like:
|
||
|
|
||
|
%%
|
||
|
mouse|rat/(cat|dog) run();
|
||
|
|
||
|
is better written:
|
||
|
|
||
|
%%
|
||
|
mouse/cat|dog run();
|
||
|
rat/cat|dog run();
|
||
|
|
||
|
or as
|
||
|
|
||
|
%%
|
||
|
mouse|rat/cat run();
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 41
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
mouse|rat/dog run();
|
||
|
|
||
|
Note that here the special '|' action does not provide any
|
||
|
savings, and can even make things worse (see Deficiencies /
|
||
|
Bugs below).
|
||
|
|
||
|
Another area where the user can increase a scanner's perfor-
|
||
|
mance (and one that's easier to implement) arises from the
|
||
|
fact that the longer the tokens matched, the faster the
|
||
|
scanner will run. This is because with long tokens the pro-
|
||
|
cessing of most input characters takes place in the (short)
|
||
|
inner scanning loop, and does not often have to go through
|
||
|
the additional work of setting up the scanning environment
|
||
|
(e.g., yytext) for the action. Recall the scanner for C
|
||
|
comments:
|
||
|
|
||
|
%x comment
|
||
|
%%
|
||
|
int line_num = 1;
|
||
|
|
||
|
"/*" BEGIN(comment);
|
||
|
|
||
|
<comment>[^*\n]*
|
||
|
<comment>"*"+[^*/\n]*
|
||
|
<comment>\n ++line_num;
|
||
|
<comment>"*"+"/" BEGIN(INITIAL);
|
||
|
|
||
|
This could be sped up by writing it as:
|
||
|
|
||
|
%x comment
|
||
|
%%
|
||
|
int line_num = 1;
|
||
|
|
||
|
"/*" BEGIN(comment);
|
||
|
|
||
|
<comment>[^*\n]*
|
||
|
<comment>[^*\n]*\n ++line_num;
|
||
|
<comment>"*"+[^*/\n]*
|
||
|
<comment>"*"+[^*/\n]*\n ++line_num;
|
||
|
<comment>"*"+"/" BEGIN(INITIAL);
|
||
|
|
||
|
Now instead of each newline requiring the processing of
|
||
|
another action, recognizing the newlines is "distributed"
|
||
|
over the other rules to keep the matched text as long as
|
||
|
possible. Note that adding rules does not slow down the
|
||
|
scanner! The speed of the scanner is independent of the
|
||
|
number of rules or (modulo the considerations given at the
|
||
|
beginning of this section) how complicated the rules are
|
||
|
with regard to operators such as '*' and '|'.
|
||
|
|
||
|
A final example in speeding up a scanner: suppose you want
|
||
|
to scan through a file containing identifiers and keywords,
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 42
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
one per line and with no other extraneous characters, and
|
||
|
recognize all the keywords. A natural first approach is:
|
||
|
|
||
|
%%
|
||
|
asm |
|
||
|
auto |
|
||
|
break |
|
||
|
... etc ...
|
||
|
volatile |
|
||
|
while /* it's a keyword */
|
||
|
|
||
|
.|\n /* it's not a keyword */
|
||
|
|
||
|
To eliminate the back-tracking, introduce a catch-all rule:
|
||
|
|
||
|
%%
|
||
|
asm |
|
||
|
auto |
|
||
|
break |
|
||
|
... etc ...
|
||
|
volatile |
|
||
|
while /* it's a keyword */
|
||
|
|
||
|
[a-z]+ |
|
||
|
.|\n /* it's not a keyword */
|
||
|
|
||
|
Now, if it's guaranteed that there's exactly one word per
|
||
|
line, then we can reduce the total number of matches by a
|
||
|
half by merging in the recognition of newlines with that of
|
||
|
the other tokens:
|
||
|
|
||
|
%%
|
||
|
asm\n |
|
||
|
auto\n |
|
||
|
break\n |
|
||
|
... etc ...
|
||
|
volatile\n |
|
||
|
while\n /* it's a keyword */
|
||
|
|
||
|
[a-z]+\n |
|
||
|
.|\n /* it's not a keyword */
|
||
|
|
||
|
One has to be careful here, as we have now reintroduced
|
||
|
backing up into the scanner. In particular, while we know
|
||
|
that there will never be any characters in the input stream
|
||
|
other than letters or newlines, flex can't figure this out,
|
||
|
and it will plan for possibly needing to back up when it has
|
||
|
scanned a token like "auto" and then the next character is
|
||
|
something other than a newline or a letter. Previously it
|
||
|
would then just match the "auto" rule and be done, but now
|
||
|
it has no "auto" rule, only a "auto\n" rule. To eliminate
|
||
|
the possibility of backing up, we could either duplicate all
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 43
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
rules but without final newlines, or, since we never expect
|
||
|
to encounter such an input and therefore don't how it's
|
||
|
classified, we can introduce one more catch-all rule, this
|
||
|
one which doesn't include a newline:
|
||
|
|
||
|
%%
|
||
|
asm\n |
|
||
|
auto\n |
|
||
|
break\n |
|
||
|
... etc ...
|
||
|
volatile\n |
|
||
|
while\n /* it's a keyword */
|
||
|
|
||
|
[a-z]+\n |
|
||
|
[a-z]+ |
|
||
|
.|\n /* it's not a keyword */
|
||
|
|
||
|
Compiled with -Cf, this is about as fast as one can get a
|
||
|
flex scanner to go for this particular problem.
|
||
|
|
||
|
A final note: flex is slow when matching NUL's, particularly
|
||
|
when a token contains multiple NUL's. It's best to write
|
||
|
rules which match short amounts of text if it's anticipated
|
||
|
that the text will often include NUL's.
|
||
|
|
||
|
Another final note regarding performance: as mentioned above
|
||
|
in the section How the Input is Matched, dynamically resiz-
|
||
|
ing yytext to accommodate huge tokens is a slow process
|
||
|
because it presently requires that the (huge) token be res-
|
||
|
canned from the beginning. Thus if performance is vital,
|
||
|
you should attempt to match "large" quantities of text but
|
||
|
not "huge" quantities, where the cutoff between the two is
|
||
|
at about 8K characters/token.
|
||
|
|
||
|
GENERATING C++ SCANNERS
|
||
|
flex provides two different ways to generate scanners for
|
||
|
use with C++. The first way is to simply compile a scanner
|
||
|
generated by flex using a C++ compiler instead of a C com-
|
||
|
piler. You should not encounter any compilations errors
|
||
|
(please report any you find to the email address given in
|
||
|
the Author section below). You can then use C++ code in
|
||
|
your rule actions instead of C code. Note that the default
|
||
|
input source for your scanner remains yyin, and default
|
||
|
echoing is still done to yyout. Both of these remain FILE *
|
||
|
variables and not C++ streams.
|
||
|
|
||
|
You can also use flex to generate a C++ scanner class, using
|
||
|
the -+ option (or, equivalently, %option c++), which is
|
||
|
automatically specified if the name of the flex executable
|
||
|
ends in a '+', such as flex++. When using this option, flex
|
||
|
defaults to generating the scanner to the file lex.yy.cc
|
||
|
instead of lex.yy.c. The generated scanner includes the
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 44
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
header file FlexLexer.h, which defines the interface to two
|
||
|
C++ classes.
|
||
|
|
||
|
The first class, FlexLexer, provides an abstract base class
|
||
|
defining the general scanner class interface. It provides
|
||
|
the following member functions:
|
||
|
|
||
|
const char* YYText()
|
||
|
returns the text of the most recently matched token,
|
||
|
the equivalent of yytext.
|
||
|
|
||
|
int YYLeng()
|
||
|
returns the length of the most recently matched token,
|
||
|
the equivalent of yyleng.
|
||
|
|
||
|
int lineno() const
|
||
|
returns the current input line number (see %option
|
||
|
yylineno), or 1 if %option yylineno was not used.
|
||
|
|
||
|
void set_debug( int flag )
|
||
|
sets the debugging flag for the scanner, equivalent to
|
||
|
assigning to yy_flex_debug (see the Options section
|
||
|
above). Note that you must build the scanner using
|
||
|
%option debug to include debugging information in it.
|
||
|
|
||
|
int debug() const
|
||
|
returns the current setting of the debugging flag.
|
||
|
|
||
|
Also provided are member functions equivalent to
|
||
|
yy_switch_to_buffer(), yy_create_buffer() (though the first
|
||
|
argument is an istream* object pointer and not a FILE*),
|
||
|
yy_flush_buffer(), yy_delete_buffer(), and yyrestart()
|
||
|
(again, the first argument is a istream* object pointer).
|
||
|
|
||
|
The second class defined in FlexLexer.h is yyFlexLexer,
|
||
|
which is derived from FlexLexer. It defines the following
|
||
|
additional member functions:
|
||
|
|
||
|
yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )
|
||
|
constructs a yyFlexLexer object using the given streams
|
||
|
for input and output. If not specified, the streams
|
||
|
default to cin and cout, respectively.
|
||
|
|
||
|
virtual int yylex()
|
||
|
performs the same role is yylex() does for ordinary
|
||
|
flex scanners: it scans the input stream, consuming
|
||
|
tokens, until a rule's action returns a value. If you
|
||
|
derive a subclass S from yyFlexLexer and want to access
|
||
|
the member functions and variables of S inside yylex(),
|
||
|
then you need to use %option yyclass="S" to inform flex
|
||
|
that you will be using that subclass instead of yyFlex-
|
||
|
Lexer. In this case, rather than generating
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 45
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
yyFlexLexer::yylex(), flex generates S::yylex() (and
|
||
|
also generates a dummy yyFlexLexer::yylex() that calls
|
||
|
yyFlexLexer::LexerError() if called).
|
||
|
|
||
|
virtual void switch_streams(istream* new_in = 0,
|
||
|
ostream* new_out = 0) reassigns yyin to new_in (if
|
||
|
non-nil) and yyout to new_out (ditto), deleting the
|
||
|
previous input buffer if yyin is reassigned.
|
||
|
|
||
|
int yylex( istream* new_in, ostream* new_out = 0 )
|
||
|
first switches the input streams via switch_streams(
|
||
|
new_in, new_out ) and then returns the value of
|
||
|
yylex().
|
||
|
|
||
|
In addition, yyFlexLexer defines the following protected
|
||
|
virtual functions which you can redefine in derived classes
|
||
|
to tailor the scanner:
|
||
|
|
||
|
virtual int LexerInput( char* buf, int max_size )
|
||
|
reads up to max_size characters into buf and returns
|
||
|
the number of characters read. To indicate end-of-
|
||
|
input, return 0 characters. Note that "interactive"
|
||
|
scanners (see the -B and -I flags) define the macro
|
||
|
YY_INTERACTIVE. If you redefine LexerInput() and need
|
||
|
to take different actions depending on whether or not
|
||
|
the scanner might be scanning an interactive input
|
||
|
source, you can test for the presence of this name via
|
||
|
#ifdef.
|
||
|
|
||
|
virtual void LexerOutput( const char* buf, int size )
|
||
|
writes out size characters from the buffer buf, which,
|
||
|
while NUL-terminated, may also contain "internal" NUL's
|
||
|
if the scanner's rules can match text with NUL's in
|
||
|
them.
|
||
|
|
||
|
virtual void LexerError( const char* msg )
|
||
|
reports a fatal error message. The default version of
|
||
|
this function writes the message to the stream cerr and
|
||
|
exits.
|
||
|
|
||
|
Note that a yyFlexLexer object contains its entire scanning
|
||
|
state. Thus you can use such objects to create reentrant
|
||
|
scanners. You can instantiate multiple instances of the
|
||
|
same yyFlexLexer class, and you can also combine multiple
|
||
|
C++ scanner classes together in the same program using the
|
||
|
-P option discussed above.
|
||
|
|
||
|
Finally, note that the %array feature is not available to
|
||
|
C++ scanner classes; you must use %pointer (the default).
|
||
|
|
||
|
Here is an example of a simple C++ scanner:
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 46
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
// An example of using the flex C++ scanner class.
|
||
|
|
||
|
%{
|
||
|
int mylineno = 0;
|
||
|
%}
|
||
|
|
||
|
string \"[^\n"]+\"
|
||
|
|
||
|
ws [ \t]+
|
||
|
|
||
|
alpha [A-Za-z]
|
||
|
dig [0-9]
|
||
|
name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])*
|
||
|
num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)?
|
||
|
num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)?
|
||
|
number {num1}|{num2}
|
||
|
|
||
|
%%
|
||
|
|
||
|
{ws} /* skip blanks and tabs */
|
||
|
|
||
|
"/*" {
|
||
|
int c;
|
||
|
|
||
|
while((c = yyinput()) != 0)
|
||
|
{
|
||
|
if(c == '\n')
|
||
|
++mylineno;
|
||
|
|
||
|
else if(c == '*')
|
||
|
{
|
||
|
if((c = yyinput()) == '/')
|
||
|
break;
|
||
|
else
|
||
|
unput(c);
|
||
|
}
|
||
|
}
|
||
|
}
|
||
|
|
||
|
{number} cout << "number " << YYText() << '\n';
|
||
|
|
||
|
\n mylineno++;
|
||
|
|
||
|
{name} cout << "name " << YYText() << '\n';
|
||
|
|
||
|
{string} cout << "string " << YYText() << '\n';
|
||
|
|
||
|
%%
|
||
|
|
||
|
int main( int /* argc */, char** /* argv */ )
|
||
|
{
|
||
|
FlexLexer* lexer = new yyFlexLexer;
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 47
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
while(lexer->yylex() != 0)
|
||
|
;
|
||
|
return 0;
|
||
|
}
|
||
|
If you want to create multiple (different) lexer classes,
|
||
|
you use the -P flag (or the prefix= option) to rename each
|
||
|
yyFlexLexer to some other xxFlexLexer. You then can include
|
||
|
<FlexLexer.h> in your other sources once per lexer class,
|
||
|
first renaming yyFlexLexer as follows:
|
||
|
|
||
|
#undef yyFlexLexer
|
||
|
#define yyFlexLexer xxFlexLexer
|
||
|
#include <FlexLexer.h>
|
||
|
|
||
|
#undef yyFlexLexer
|
||
|
#define yyFlexLexer zzFlexLexer
|
||
|
#include <FlexLexer.h>
|
||
|
|
||
|
if, for example, you used %option prefix="xx" for one of
|
||
|
your scanners and %option prefix="zz" for the other.
|
||
|
|
||
|
IMPORTANT: the present form of the scanning class is experi-
|
||
|
mental and may change considerably between major releases.
|
||
|
|
||
|
INCOMPATIBILITIES WITH LEX AND POSIX
|
||
|
flex is a rewrite of the AT&T Unix lex tool (the two imple-
|
||
|
mentations do not share any code, though), with some exten-
|
||
|
sions and incompatibilities, both of which are of concern to
|
||
|
those who wish to write scanners acceptable to either imple-
|
||
|
mentation. Flex is fully compliant with the POSIX lex
|
||
|
specification, except that when using %pointer (the
|
||
|
default), a call to unput() destroys the contents of yytext,
|
||
|
which is counter to the POSIX specification.
|
||
|
|
||
|
In this section we discuss all of the known areas of incom-
|
||
|
patibility between flex, AT&T lex, and the POSIX specifica-
|
||
|
tion.
|
||
|
|
||
|
flex's -l option turns on maximum compatibility with the
|
||
|
original AT&T lex implementation, at the cost of a major
|
||
|
loss in the generated scanner's performance. We note below
|
||
|
which incompatibilities can be overcome using the -l option.
|
||
|
|
||
|
flex is fully compatible with lex with the following excep-
|
||
|
tions:
|
||
|
|
||
|
- The undocumented lex scanner internal variable yylineno
|
||
|
is not supported unless -l or %option yylineno is used.
|
||
|
|
||
|
yylineno should be maintained on a per-buffer basis,
|
||
|
rather than a per-scanner (single global variable)
|
||
|
basis.
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 48
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
yylineno is not part of the POSIX specification.
|
||
|
|
||
|
- The input() routine is not redefinable, though it may
|
||
|
be called to read characters following whatever has
|
||
|
been matched by a rule. If input() encounters an end-
|
||
|
of-file the normal yywrap() processing is done. A
|
||
|
``real'' end-of-file is returned by input() as EOF.
|
||
|
|
||
|
Input is instead controlled by defining the YY_INPUT
|
||
|
macro.
|
||
|
|
||
|
The flex restriction that input() cannot be redefined
|
||
|
is in accordance with the POSIX specification, which
|
||
|
simply does not specify any way of controlling the
|
||
|
scanner's input other than by making an initial assign-
|
||
|
ment to yyin.
|
||
|
|
||
|
- The unput() routine is not redefinable. This restric-
|
||
|
tion is in accordance with POSIX.
|
||
|
|
||
|
- flex scanners are not as reentrant as lex scanners. In
|
||
|
particular, if you have an interactive scanner and an
|
||
|
interrupt handler which long-jumps out of the scanner,
|
||
|
and the scanner is subsequently called again, you may
|
||
|
get the following message:
|
||
|
|
||
|
fatal flex scanner internal error--end of buffer missed
|
||
|
|
||
|
To reenter the scanner, first use
|
||
|
|
||
|
yyrestart( yyin );
|
||
|
|
||
|
Note that this call will throw away any buffered input;
|
||
|
usually this isn't a problem with an interactive
|
||
|
scanner.
|
||
|
|
||
|
Also note that flex C++ scanner classes are reentrant,
|
||
|
so if using C++ is an option for you, you should use
|
||
|
them instead. See "Generating C++ Scanners" above for
|
||
|
details.
|
||
|
|
||
|
- output() is not supported. Output from the ECHO macro
|
||
|
is done to the file-pointer yyout (default stdout).
|
||
|
|
||
|
output() is not part of the POSIX specification.
|
||
|
|
||
|
- lex does not support exclusive start conditions (%x),
|
||
|
though they are in the POSIX specification.
|
||
|
|
||
|
- When definitions are expanded, flex encloses them in
|
||
|
parentheses. With lex, the following:
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 49
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
NAME [A-Z][A-Z0-9]*
|
||
|
%%
|
||
|
foo{NAME}? printf( "Found it\n" );
|
||
|
%%
|
||
|
|
||
|
will not match the string "foo" because when the macro
|
||
|
is expanded the rule is equivalent to "foo[A-Z][A-Z0-
|
||
|
9]*?" and the precedence is such that the '?' is asso-
|
||
|
ciated with "[A-Z0-9]*". With flex, the rule will be
|
||
|
expanded to "foo([A-Z][A-Z0-9]*)?" and so the string
|
||
|
"foo" will match.
|
||
|
|
||
|
Note that if the definition begins with ^ or ends with
|
||
|
$ then it is not expanded with parentheses, to allow
|
||
|
these operators to appear in definitions without losing
|
||
|
their special meanings. But the <s>, /, and <<EOF>>
|
||
|
operators cannot be used in a flex definition.
|
||
|
|
||
|
Using -l results in the lex behavior of no parentheses
|
||
|
around the definition.
|
||
|
|
||
|
The POSIX specification is that the definition be
|
||
|
enclosed in parentheses.
|
||
|
|
||
|
- Some implementations of lex allow a rule's action to
|
||
|
begin on a separate line, if the rule's pattern has
|
||
|
trailing whitespace:
|
||
|
|
||
|
%%
|
||
|
foo|bar<space here>
|
||
|
{ foobar_action(); }
|
||
|
|
||
|
flex does not support this feature.
|
||
|
|
||
|
- The lex %r (generate a Ratfor scanner) option is not
|
||
|
supported. It is not part of the POSIX specification.
|
||
|
|
||
|
- After a call to unput(), yytext is undefined until the
|
||
|
next token is matched, unless the scanner was built
|
||
|
using %array. This is not the case with lex or the
|
||
|
POSIX specification. The -l option does away with this
|
||
|
incompatibility.
|
||
|
|
||
|
- The precedence of the {} (numeric range) operator is
|
||
|
different. lex interprets "abc{1,3}" as "match one,
|
||
|
two, or three occurrences of 'abc'", whereas flex
|
||
|
interprets it as "match 'ab' followed by one, two, or
|
||
|
three occurrences of 'c'". The latter is in agreement
|
||
|
with the POSIX specification.
|
||
|
|
||
|
- The precedence of the ^ operator is different. lex
|
||
|
interprets "^foo|bar" as "match either 'foo' at the
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 50
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
beginning of a line, or 'bar' anywhere", whereas flex
|
||
|
interprets it as "match either 'foo' or 'bar' if they
|
||
|
come at the beginning of a line". The latter is in
|
||
|
agreement with the POSIX specification.
|
||
|
|
||
|
- The special table-size declarations such as %a sup-
|
||
|
ported by lex are not required by flex scanners; flex
|
||
|
ignores them.
|
||
|
|
||
|
- The name FLEX_SCANNER is #define'd so scanners may be
|
||
|
written for use with either flex or lex. Scanners also
|
||
|
include YY_FLEX_MAJOR_VERSION and YY_FLEX_MINOR_VERSION
|
||
|
indicating which version of flex generated the scanner
|
||
|
(for example, for the 2.5 release, these defines would
|
||
|
be 2 and 5 respectively).
|
||
|
|
||
|
The following flex features are not included in lex or the
|
||
|
POSIX specification:
|
||
|
|
||
|
C++ scanners
|
||
|
%option
|
||
|
start condition scopes
|
||
|
start condition stacks
|
||
|
interactive/non-interactive scanners
|
||
|
yy_scan_string() and friends
|
||
|
yyterminate()
|
||
|
yy_set_interactive()
|
||
|
yy_set_bol()
|
||
|
YY_AT_BOL()
|
||
|
<<EOF>>
|
||
|
<*>
|
||
|
YY_DECL
|
||
|
YY_START
|
||
|
YY_USER_ACTION
|
||
|
YY_USER_INIT
|
||
|
#line directives
|
||
|
%{}'s around actions
|
||
|
multiple actions on a line
|
||
|
|
||
|
plus almost all of the flex flags. The last feature in the
|
||
|
list refers to the fact that with flex you can put multiple
|
||
|
actions on the same line, separated with semi-colons, while
|
||
|
with lex, the following
|
||
|
|
||
|
foo handle_foo(); ++num_foos_seen;
|
||
|
|
||
|
is (rather surprisingly) truncated to
|
||
|
|
||
|
foo handle_foo();
|
||
|
|
||
|
flex does not truncate the action. Actions that are not
|
||
|
enclosed in braces are simply terminated at the end of the
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 51
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
line.
|
||
|
|
||
|
DIAGNOSTICS
|
||
|
warning, rule cannot be matched indicates that the given
|
||
|
rule cannot be matched because it follows other rules that
|
||
|
will always match the same text as it. For example, in the
|
||
|
following "foo" cannot be matched because it comes after an
|
||
|
identifier "catch-all" rule:
|
||
|
|
||
|
[a-z]+ got_identifier();
|
||
|
foo got_foo();
|
||
|
|
||
|
Using REJECT in a scanner suppresses this warning.
|
||
|
|
||
|
warning, -s option given but default rule can be matched
|
||
|
means that it is possible (perhaps only in a particular
|
||
|
start condition) that the default rule (match any single
|
||
|
character) is the only one that will match a particular
|
||
|
input. Since -s was given, presumably this is not intended.
|
||
|
|
||
|
reject_used_but_not_detected undefined or
|
||
|
yymore_used_but_not_detected undefined - These errors can
|
||
|
occur at compile time. They indicate that the scanner uses
|
||
|
REJECT or yymore() but that flex failed to notice the fact,
|
||
|
meaning that flex scanned the first two sections looking for
|
||
|
occurrences of these actions and failed to find any, but
|
||
|
somehow you snuck some in (via a #include file, for exam-
|
||
|
ple). Use %option reject or %option yymore to indicate to
|
||
|
flex that you really do use these features.
|
||
|
|
||
|
flex scanner jammed - a scanner compiled with -s has encoun-
|
||
|
tered an input string which wasn't matched by any of its
|
||
|
rules. This error can also occur due to internal problems.
|
||
|
|
||
|
token too large, exceeds YYLMAX - your scanner uses %array
|
||
|
and one of its rules matched a string longer than the YYLMAX
|
||
|
constant (8K bytes by default). You can increase the value
|
||
|
by #define'ing YYLMAX in the definitions section of your
|
||
|
flex input.
|
||
|
|
||
|
scanner requires -8 flag to use the character 'x' - Your
|
||
|
scanner specification includes recognizing the 8-bit charac-
|
||
|
ter 'x' and you did not specify the -8 flag, and your
|
||
|
scanner defaulted to 7-bit because you used the -Cf or -CF
|
||
|
table compression options. See the discussion of the -7
|
||
|
flag for details.
|
||
|
|
||
|
flex scanner push-back overflow - you used unput() to push
|
||
|
back so much text that the scanner's buffer could not hold
|
||
|
both the pushed-back text and the current token in yytext.
|
||
|
Ideally the scanner should dynamically resize the buffer in
|
||
|
this case, but at present it does not.
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 52
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
input buffer overflow, can't enlarge buffer because scanner
|
||
|
uses REJECT - the scanner was working on matching an
|
||
|
extremely large token and needed to expand the input buffer.
|
||
|
This doesn't work with scanners that use REJECT.
|
||
|
|
||
|
fatal flex scanner internal error--end of buffer missed -
|
||
|
This can occur in an scanner which is reentered after a
|
||
|
long-jump has jumped out (or over) the scanner's activation
|
||
|
frame. Before reentering the scanner, use:
|
||
|
|
||
|
yyrestart( yyin );
|
||
|
|
||
|
or, as noted above, switch to using the C++ scanner class.
|
||
|
|
||
|
too many start conditions in <> you listed more start condi-
|
||
|
tions in a <> construct than exist (so you must have listed
|
||
|
at least one of them twice).
|
||
|
|
||
|
FILES
|
||
|
-lfl library with which scanners must be linked.
|
||
|
|
||
|
lex.yy.c
|
||
|
generated scanner (called lexyy.c on some systems).
|
||
|
|
||
|
lex.yy.cc
|
||
|
generated C++ scanner class, when using -+.
|
||
|
|
||
|
<FlexLexer.h>
|
||
|
header file defining the C++ scanner base class, Flex-
|
||
|
Lexer, and its derived class, yyFlexLexer.
|
||
|
|
||
|
flex.skl
|
||
|
skeleton scanner. This file is only used when building
|
||
|
flex, not when flex executes.
|
||
|
|
||
|
lex.backup
|
||
|
backing-up information for -b flag (called lex.bck on
|
||
|
some systems).
|
||
|
|
||
|
DEFICIENCIES / BUGS
|
||
|
Some trailing context patterns cannot be properly matched
|
||
|
and generate warning messages ("dangerous trailing con-
|
||
|
text"). These are patterns where the ending of the first
|
||
|
part of the rule matches the beginning of the second part,
|
||
|
such as "zx*/xy*", where the 'x*' matches the 'x' at the
|
||
|
beginning of the trailing context. (Note that the POSIX
|
||
|
draft states that the text matched by such patterns is unde-
|
||
|
fined.)
|
||
|
|
||
|
For some trailing context rules, parts which are actually
|
||
|
fixed-length are not recognized as such, leading to the
|
||
|
abovementioned performance loss. In particular, parts using
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 53
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
'|' or {n} (such as "foo{3}") are always considered
|
||
|
variable-length.
|
||
|
|
||
|
Combining trailing context with the special '|' action can
|
||
|
result in fixed trailing context being turned into the more
|
||
|
expensive variable trailing context. For example, in the
|
||
|
following:
|
||
|
|
||
|
%%
|
||
|
abc |
|
||
|
xyz/def
|
||
|
|
||
|
|
||
|
Use of unput() invalidates yytext and yyleng, unless the
|
||
|
%array directive or the -l option has been used.
|
||
|
|
||
|
Pattern-matching of NUL's is substantially slower than
|
||
|
matching other characters.
|
||
|
|
||
|
Dynamic resizing of the input buffer is slow, as it entails
|
||
|
rescanning all the text matched so far by the current (gen-
|
||
|
erally huge) token.
|
||
|
|
||
|
Due to both buffering of input and read-ahead, you cannot
|
||
|
intermix calls to <stdio.h> routines, such as, for example,
|
||
|
getchar(), with flex rules and expect it to work. Call
|
||
|
input() instead.
|
||
|
|
||
|
The total table entries listed by the -v flag excludes the
|
||
|
number of table entries needed to determine what rule has
|
||
|
been matched. The number of entries is equal to the number
|
||
|
of DFA states if the scanner does not use REJECT, and some-
|
||
|
what greater than the number of states if it does.
|
||
|
|
||
|
REJECT cannot be used with the -f or -F options.
|
||
|
|
||
|
The flex internal algorithms need documentation.
|
||
|
|
||
|
SEE ALSO
|
||
|
lex(1), yacc(1), sed(1), awk(1).
|
||
|
|
||
|
John Levine, Tony Mason, and Doug Brown, Lex & Yacc,
|
||
|
O'Reilly and Associates. Be sure to get the 2nd edition.
|
||
|
|
||
|
M. E. Lesk and E. Schmidt, LEX - Lexical Analyzer Generator
|
||
|
|
||
|
Alfred Aho, Ravi Sethi and Jeffrey Ullman, Compilers: Prin-
|
||
|
ciples, Techniques and Tools, Addison-Wesley (1986).
|
||
|
Describes the pattern-matching techniques used by flex
|
||
|
(deterministic finite automata).
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 54
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
AUTHOR
|
||
|
Vern Paxson, with the help of many ideas and much inspira-
|
||
|
tion from Van Jacobson. Original version by Jef Poskanzer.
|
||
|
The fast table representation is a partial implementation of
|
||
|
a design done by Van Jacobson. The implementation was done
|
||
|
by Kevin Gong and Vern Paxson.
|
||
|
|
||
|
Thanks to the many flex beta-testers, feedbackers, and con-
|
||
|
tributors, especially Francois Pinard, Casey Leedom, Robert
|
||
|
Abramovitz, Stan Adermann, Terry Allen, David Barker-
|
||
|
Plummer, John Basrai, Neal Becker, Nelson H.F. Beebe,
|
||
|
benson@odi.com, Karl Berry, Peter A. Bigot, Simon Blanchard,
|
||
|
Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick
|
||
|
Christopher, Brian Clapper, J.T. Conklin, Jason Coughlin,
|
||
|
Bill Cox, Nick Cropper, Dave Curtis, Scott David Daniels,
|
||
|
Chris G. Demetriou, Theo Deraadt, Mike Donahue, Chuck
|
||
|
Doucette, Tom Epperly, Leo Eskin, Chris Faylor, Chris
|
||
|
Flatters, Jon Forrest, Jeffrey Friedl, Joe Gayda, Kaveh R.
|
||
|
Ghazi, Wolfgang Glunz, Eric Goldman, Christopher M. Gould,
|
||
|
Ulrich Grepel, Peer Griebel, Jan Hajic, Charles Hemphill,
|
||
|
NORO Hideo, Jarkko Hietaniemi, Scott Hofmann, Jeff Honig,
|
||
|
Dana Hudes, Eric Hughes, John Interrante, Ceriel Jacobs,
|
||
|
Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones, Henry
|
||
|
Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
|
||
|
Amir Katz, ken@ken.hilco.com, Kevin B. Kenny, Steve Kirsch,
|
||
|
Winfried Koenig, Marq Kole, Ronald Lamprecht, Greg Lee,
|
||
|
Rohan Lenard, Craig Leres, John Levine, Steve Liddle, David
|
||
|
Loffredo, Mike Long, Mohamed el Lozy, Brian Madsen, Malte,
|
||
|
Joe Marshall, Bengt Martensson, Chris Metcalf, Luke Mewburn,
|
||
|
Jim Meyering, R. Alexander Milowski, Erik Naggum, G.T.
|
||
|
Nicol, Landon Noll, James Nordby, Marc Nozell, Richard
|
||
|
Ohnemus, Karsten Pahnke, Sven Panne, Roland Pesch, Walter
|
||
|
Pelissero, Gaumond Pierre, Esmond Pitt, Jef Poskanzer, Joe
|
||
|
Rahmeh, Jarmo Raiha, Frederic Raimbault, Pat Rankin, Rick
|
||
|
Richardson, Kevin Rodgers, Kai Uwe Rommel, Jim Roskind,
|
||
|
Alberto Santini, Andreas Scherer, Darrell Schiebel, Raf
|
||
|
Schietekat, Doug Schmidt, Philippe Schnoebelen, Andreas
|
||
|
Schwab, Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-
|
||
|
Erik Strvmquist, Mike Stump, Paul Stuart, Dave Tallman, Ian
|
||
|
Lance Taylor, Chris Thewalt, Richard M. Timoney, Jodi Tsai,
|
||
|
Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms,
|
||
|
Kent Williams, Ken Yap, Ron Zellar, Nathan Zelle, David
|
||
|
Zuhn, and those whose names have slipped my marginal mail-
|
||
|
archiving skills but whose contributions are appreciated all
|
||
|
the same.
|
||
|
|
||
|
Thanks to Keith Bostic, Jon Forrest, Noah Friedman, John
|
||
|
Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T. Nicol,
|
||
|
Francois Pinard, Rich Salz, and Richard Stallman for help
|
||
|
with various distribution headaches.
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 55
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
FLEX(1) USER COMMANDS FLEX(1)
|
||
|
|
||
|
|
||
|
|
||
|
Thanks to Esmond Pitt and Earle Horton for 8-bit character
|
||
|
support; to Benson Margulies and Fred Burke for C++ support;
|
||
|
to Kent Williams and Tom Epperly for C++ class support; to
|
||
|
Ove Ewerlid for support of NUL's; and to Eric Hughes for
|
||
|
support of multiple buffers.
|
||
|
|
||
|
This work was primarily done when I was with the Real Time
|
||
|
Systems Group at the Lawrence Berkeley Laboratory in Berke-
|
||
|
ley, CA. Many thanks to all there for the support I
|
||
|
received.
|
||
|
|
||
|
Send comments to vern@ee.lbl.gov.
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
Version 2.5 Last change: April 1995 56
|
||
|
|
||
|
|
||
|
|