2952 lines
109 KiB
Text
2952 lines
109 KiB
Text
|
This is Info file flex.info, produced by Makeinfo-1.55 from the input
|
|||
|
file flex.texi.
|
|||
|
|
|||
|
START-INFO-DIR-ENTRY
|
|||
|
* Flex: (flex). A fast scanner generator.
|
|||
|
END-INFO-DIR-ENTRY
|
|||
|
|
|||
|
This file documents Flex.
|
|||
|
|
|||
|
Copyright (c) 1990 The Regents of the University of California. All
|
|||
|
rights reserved.
|
|||
|
|
|||
|
This code is derived from software contributed to Berkeley by Vern
|
|||
|
Paxson.
|
|||
|
|
|||
|
The United States Government has rights in this work pursuant to
|
|||
|
contract no. DE-AC03-76SF00098 between the United States Department of
|
|||
|
Energy and the University of California.
|
|||
|
|
|||
|
Redistribution and use in source and binary forms with or without
|
|||
|
modification are permitted provided that: (1) source distributions
|
|||
|
retain this entire copyright notice and comment, and (2) distributions
|
|||
|
including binaries display the following acknowledgement: "This
|
|||
|
product includes software developed by the University of California,
|
|||
|
Berkeley and its contributors" in the documentation or other materials
|
|||
|
provided with the distribution and in all advertising materials
|
|||
|
mentioning features or use of this software. Neither the name of the
|
|||
|
University nor the names of its contributors may be used to endorse or
|
|||
|
promote products derived from this software without specific prior
|
|||
|
written permission.
|
|||
|
|
|||
|
THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
|
|||
|
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
|
|||
|
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Top, Next: Name, Prev: (dir), Up: (dir)
|
|||
|
|
|||
|
flex
|
|||
|
****
|
|||
|
|
|||
|
This manual documents `flex'. It covers release 2.5.
|
|||
|
|
|||
|
* Menu:
|
|||
|
|
|||
|
* Name:: Name
|
|||
|
* Synopsis:: Synopsis
|
|||
|
* Overview:: Overview
|
|||
|
* Description:: Description
|
|||
|
* Examples:: Some simple examples
|
|||
|
* Format:: Format of the input file
|
|||
|
* Patterns:: Patterns
|
|||
|
* Matching:: How the input is matched
|
|||
|
* Actions:: Actions
|
|||
|
* Generated scanner:: The generated scanner
|
|||
|
* Start conditions:: Start conditions
|
|||
|
* Multiple buffers:: Multiple input buffers
|
|||
|
* End-of-file rules:: End-of-file rules
|
|||
|
* Miscellaneous:: Miscellaneous macros
|
|||
|
* User variables:: Values available to the user
|
|||
|
* YACC interface:: Interfacing with `yacc'
|
|||
|
* Options:: Options
|
|||
|
* Performance:: Performance considerations
|
|||
|
* C++:: Generating C++ scanners
|
|||
|
* Incompatibilities:: Incompatibilities with `lex' and POSIX
|
|||
|
* Diagnostics:: Diagnostics
|
|||
|
* Files:: Files
|
|||
|
* Deficiencies:: Deficiencies / Bugs
|
|||
|
* See also:: See also
|
|||
|
* Author:: Author
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Name, Next: Synopsis, Prev: Top, Up: Top
|
|||
|
|
|||
|
Name
|
|||
|
====
|
|||
|
|
|||
|
flex - fast lexical analyzer generator
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Synopsis, Next: Overview, Prev: Name, Up: Top
|
|||
|
|
|||
|
Synopsis
|
|||
|
========
|
|||
|
|
|||
|
flex [-bcdfhilnpstvwBFILTV78+? -C[aefFmr] -ooutput -Pprefix -Sskeleton]
|
|||
|
[--help --version] [FILENAME ...]
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Overview, Next: Description, Prev: Synopsis, Up: Top
|
|||
|
|
|||
|
Overview
|
|||
|
========
|
|||
|
|
|||
|
This manual describes `flex', a tool for generating programs that
|
|||
|
perform pattern-matching on text. The manual includes both tutorial
|
|||
|
and reference sections:
|
|||
|
|
|||
|
Description
|
|||
|
a brief overview of the tool
|
|||
|
|
|||
|
Some Simple Examples
|
|||
|
Format Of The Input File
|
|||
|
Patterns
|
|||
|
the extended regular expressions used by flex
|
|||
|
|
|||
|
How The Input Is Matched
|
|||
|
the rules for determining what has been matched
|
|||
|
|
|||
|
Actions
|
|||
|
how to specify what to do when a pattern is matched
|
|||
|
|
|||
|
The Generated Scanner
|
|||
|
details regarding the scanner that flex produces; how to control
|
|||
|
the input source
|
|||
|
|
|||
|
Start Conditions
|
|||
|
introducing context into your scanners, and managing
|
|||
|
"mini-scanners"
|
|||
|
|
|||
|
Multiple Input Buffers
|
|||
|
how to manipulate multiple input sources; how to scan from strings
|
|||
|
instead of files
|
|||
|
|
|||
|
End-of-file Rules
|
|||
|
special rules for matching the end of the input
|
|||
|
|
|||
|
Miscellaneous Macros
|
|||
|
a summary of macros available to the actions
|
|||
|
|
|||
|
Values Available To The User
|
|||
|
a summary of values available to the actions
|
|||
|
|
|||
|
Interfacing With Yacc
|
|||
|
connecting flex scanners together with yacc parsers
|
|||
|
|
|||
|
Options
|
|||
|
flex command-line options, and the "%option" directive
|
|||
|
|
|||
|
Performance Considerations
|
|||
|
how to make your scanner go as fast as possible
|
|||
|
|
|||
|
Generating C++ Scanners
|
|||
|
the (experimental) facility for generating C++ scanner classes
|
|||
|
|
|||
|
Incompatibilities With Lex And POSIX
|
|||
|
how flex differs from AT&T lex and the POSIX lex standard
|
|||
|
|
|||
|
Diagnostics
|
|||
|
those error messages produced by flex (or scanners it generates)
|
|||
|
whose meanings might not be apparent
|
|||
|
|
|||
|
Files
|
|||
|
files used by flex
|
|||
|
|
|||
|
Deficiencies / Bugs
|
|||
|
known problems with flex
|
|||
|
|
|||
|
See Also
|
|||
|
other documentation, related tools
|
|||
|
|
|||
|
Author
|
|||
|
includes contact information
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Description, Next: Examples, Prev: Overview, Up: Top
|
|||
|
|
|||
|
Description
|
|||
|
===========
|
|||
|
|
|||
|
`flex' is a tool for generating "scanners": programs which
|
|||
|
recognized lexical patterns in text. `flex' reads the given input
|
|||
|
files, or its standard input if no file names are given, for a
|
|||
|
description of a scanner to generate. The description is in the form
|
|||
|
of pairs of regular expressions and C code, called "rules". `flex'
|
|||
|
generates as output a C source file, `lex.yy.c', which defines a
|
|||
|
routine `yylex()'. This file is compiled and linked with the `-lfl'
|
|||
|
library to produce an executable. When the executable is run, it
|
|||
|
analyzes its input for occurrences of the regular expressions.
|
|||
|
Whenever it finds one, it executes the corresponding C code.
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Examples, Next: Format, Prev: Description, Up: Top
|
|||
|
|
|||
|
Some simple examples
|
|||
|
====================
|
|||
|
|
|||
|
First some simple examples to get the flavor of how one uses `flex'.
|
|||
|
The following `flex' input specifies a scanner which whenever it
|
|||
|
encounters the string "username" will replace it with the user's login
|
|||
|
name:
|
|||
|
|
|||
|
%%
|
|||
|
username printf( "%s", getlogin() );
|
|||
|
|
|||
|
By default, any text not matched by a `flex' scanner is copied to
|
|||
|
the output, so the net effect of this scanner is to copy its input file
|
|||
|
to its output with each occurrence of "username" expanded. In this
|
|||
|
input, there is just one rule. "username" is the PATTERN and the
|
|||
|
"printf" is the ACTION. The "%%" marks the beginning of the rules.
|
|||
|
|
|||
|
Here's another simple example:
|
|||
|
|
|||
|
int num_lines = 0, num_chars = 0;
|
|||
|
|
|||
|
%%
|
|||
|
\n ++num_lines; ++num_chars;
|
|||
|
. ++num_chars;
|
|||
|
|
|||
|
%%
|
|||
|
main()
|
|||
|
{
|
|||
|
yylex();
|
|||
|
printf( "# of lines = %d, # of chars = %d\n",
|
|||
|
num_lines, num_chars );
|
|||
|
}
|
|||
|
|
|||
|
This scanner counts the number of characters and the number of lines
|
|||
|
in its input (it produces no output other than the final report on the
|
|||
|
counts). The first line declares two globals, "num_lines" and
|
|||
|
"num_chars", which are accessible both inside `yylex()' and in the
|
|||
|
`main()' routine declared after the second "%%". There are two rules,
|
|||
|
one which matches a newline ("\n") and increments both the line count
|
|||
|
and the character count, and one which matches any character other than
|
|||
|
a newline (indicated by the "." regular expression).
|
|||
|
|
|||
|
A somewhat more complicated example:
|
|||
|
|
|||
|
/* scanner for a toy Pascal-like language */
|
|||
|
|
|||
|
%{
|
|||
|
/* need this for the call to atof() below */
|
|||
|
#include <math.h>
|
|||
|
%}
|
|||
|
|
|||
|
DIGIT [0-9]
|
|||
|
ID [a-z][a-z0-9]*
|
|||
|
|
|||
|
%%
|
|||
|
|
|||
|
{DIGIT}+ {
|
|||
|
printf( "An integer: %s (%d)\n", yytext,
|
|||
|
atoi( yytext ) );
|
|||
|
}
|
|||
|
|
|||
|
{DIGIT}+"."{DIGIT}* {
|
|||
|
printf( "A float: %s (%g)\n", yytext,
|
|||
|
atof( yytext ) );
|
|||
|
}
|
|||
|
|
|||
|
if|then|begin|end|procedure|function {
|
|||
|
printf( "A keyword: %s\n", yytext );
|
|||
|
}
|
|||
|
|
|||
|
{ID} printf( "An identifier: %s\n", yytext );
|
|||
|
|
|||
|
"+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext );
|
|||
|
|
|||
|
"{"[^}\n]*"}" /* eat up one-line comments */
|
|||
|
|
|||
|
[ \t\n]+ /* eat up whitespace */
|
|||
|
|
|||
|
. printf( "Unrecognized character: %s\n", yytext );
|
|||
|
|
|||
|
%%
|
|||
|
|
|||
|
main( argc, argv )
|
|||
|
int argc;
|
|||
|
char **argv;
|
|||
|
{
|
|||
|
++argv, --argc; /* skip over program name */
|
|||
|
if ( argc > 0 )
|
|||
|
yyin = fopen( argv[0], "r" );
|
|||
|
else
|
|||
|
yyin = stdin;
|
|||
|
|
|||
|
yylex();
|
|||
|
}
|
|||
|
|
|||
|
This is the beginnings of a simple scanner for a language like
|
|||
|
Pascal. It identifies different types of TOKENS and reports on what it
|
|||
|
has seen.
|
|||
|
|
|||
|
The details of this example will be explained in the following
|
|||
|
sections.
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Format, Next: Patterns, Prev: Examples, Up: Top
|
|||
|
|
|||
|
Format of the input file
|
|||
|
========================
|
|||
|
|
|||
|
The `flex' input file consists of three sections, separated by a
|
|||
|
line with just `%%' in it:
|
|||
|
|
|||
|
definitions
|
|||
|
%%
|
|||
|
rules
|
|||
|
%%
|
|||
|
user code
|
|||
|
|
|||
|
The "definitions" section contains declarations of simple "name"
|
|||
|
definitions to simplify the scanner specification, and declarations of
|
|||
|
"start conditions", which are explained in a later section. Name
|
|||
|
definitions have the form:
|
|||
|
|
|||
|
name definition
|
|||
|
|
|||
|
The "name" is a word beginning with a letter or an underscore ('_')
|
|||
|
followed by zero or more letters, digits, '_', or '-' (dash). The
|
|||
|
definition is taken to begin at the first non-white-space character
|
|||
|
following the name and continuing to the end of the line. The
|
|||
|
definition can subsequently be referred to using "{name}", which will
|
|||
|
expand to "(definition)". For example,
|
|||
|
|
|||
|
DIGIT [0-9]
|
|||
|
ID [a-z][a-z0-9]*
|
|||
|
|
|||
|
defines "DIGIT" to be a regular expression which matches a single
|
|||
|
digit, and "ID" to be a regular expression which matches a letter
|
|||
|
followed by zero-or-more letters-or-digits. A subsequent reference to
|
|||
|
|
|||
|
{DIGIT}+"."{DIGIT}*
|
|||
|
|
|||
|
is identical to
|
|||
|
|
|||
|
([0-9])+"."([0-9])*
|
|||
|
|
|||
|
and matches one-or-more digits followed by a '.' followed by
|
|||
|
zero-or-more digits.
|
|||
|
|
|||
|
The RULES section of the `flex' input contains a series of rules of
|
|||
|
the form:
|
|||
|
|
|||
|
pattern action
|
|||
|
|
|||
|
where the pattern must be unindented and the action must begin on the
|
|||
|
same line.
|
|||
|
|
|||
|
See below for a further description of patterns and actions.
|
|||
|
|
|||
|
Finally, the user code section is simply copied to `lex.yy.c'
|
|||
|
verbatim. It is used for companion routines which call or are called
|
|||
|
by the scanner. The presence of this section is optional; if it is
|
|||
|
missing, the second `%%' in the input file may be skipped, too.
|
|||
|
|
|||
|
In the definitions and rules sections, any *indented* text or text
|
|||
|
enclosed in `%{' and `%}' is copied verbatim to the output (with the
|
|||
|
`%{}''s removed). The `%{}''s must appear unindented on lines by
|
|||
|
themselves.
|
|||
|
|
|||
|
In the rules section, any indented or %{} text appearing before the
|
|||
|
first rule may be used to declare variables which are local to the
|
|||
|
scanning routine and (after the declarations) code which is to be
|
|||
|
executed whenever the scanning routine is entered. Other indented or
|
|||
|
%{} text in the rule section is still copied to the output, but its
|
|||
|
meaning is not well-defined and it may well cause compile-time errors
|
|||
|
(this feature is present for `POSIX' compliance; see below for other
|
|||
|
such features).
|
|||
|
|
|||
|
In the definitions section (but not in the rules section), an
|
|||
|
unindented comment (i.e., a line beginning with "/*") is also copied
|
|||
|
verbatim to the output up to the next "*/".
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Patterns, Next: Matching, Prev: Format, Up: Top
|
|||
|
|
|||
|
Patterns
|
|||
|
========
|
|||
|
|
|||
|
The patterns in the input are written using an extended set of
|
|||
|
regular expressions. These are:
|
|||
|
|
|||
|
`x'
|
|||
|
match the character `x'
|
|||
|
|
|||
|
`.'
|
|||
|
any character (byte) except newline
|
|||
|
|
|||
|
`[xyz]'
|
|||
|
a "character class"; in this case, the pattern matches either an
|
|||
|
`x', a `y', or a `z'
|
|||
|
|
|||
|
`[abj-oZ]'
|
|||
|
a "character class" with a range in it; matches an `a', a `b', any
|
|||
|
letter from `j' through `o', or a `Z'
|
|||
|
|
|||
|
`[^A-Z]'
|
|||
|
a "negated character class", i.e., any character but those in the
|
|||
|
class. In this case, any character EXCEPT an uppercase letter.
|
|||
|
|
|||
|
`[^A-Z\n]'
|
|||
|
any character EXCEPT an uppercase letter or a newline
|
|||
|
|
|||
|
`R*'
|
|||
|
zero or more R's, where R is any regular expression
|
|||
|
|
|||
|
`R+'
|
|||
|
one or more R's
|
|||
|
|
|||
|
`R?'
|
|||
|
zero or one R's (that is, "an optional R")
|
|||
|
|
|||
|
`R{2,5}'
|
|||
|
anywhere from two to five R's
|
|||
|
|
|||
|
`R{2,}'
|
|||
|
two or more R's
|
|||
|
|
|||
|
`R{4}'
|
|||
|
exactly 4 R's
|
|||
|
|
|||
|
`{NAME}'
|
|||
|
the expansion of the "NAME" definition (see above)
|
|||
|
|
|||
|
`"[xyz]\"foo"'
|
|||
|
the literal string: `[xyz]"foo'
|
|||
|
|
|||
|
`\X'
|
|||
|
if X is an `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C
|
|||
|
interpretation of \X. Otherwise, a literal `X' (used to escape
|
|||
|
operators such as `*')
|
|||
|
|
|||
|
`\0'
|
|||
|
a NUL character (ASCII code 0)
|
|||
|
|
|||
|
`\123'
|
|||
|
the character with octal value 123
|
|||
|
|
|||
|
`\x2a'
|
|||
|
the character with hexadecimal value `2a'
|
|||
|
|
|||
|
`(R)'
|
|||
|
match an R; parentheses are used to override precedence (see below)
|
|||
|
|
|||
|
`RS'
|
|||
|
the regular expression R followed by the regular expression S;
|
|||
|
called "concatenation"
|
|||
|
|
|||
|
`R|S'
|
|||
|
either an R or an S
|
|||
|
|
|||
|
`R/S'
|
|||
|
an R but only if it is followed by an S. The text matched by S is
|
|||
|
included when determining whether this rule is the "longest
|
|||
|
match", but is then returned to the input before the action is
|
|||
|
executed. So the action only sees the text matched by R. This
|
|||
|
type of pattern is called "trailing context". (There are some
|
|||
|
combinations of `R/S' that `flex' cannot match correctly; see
|
|||
|
notes in the Deficiencies / Bugs section below regarding
|
|||
|
"dangerous trailing context".)
|
|||
|
|
|||
|
`^R'
|
|||
|
an R, but only at the beginning of a line (i.e., which just
|
|||
|
starting to scan, or right after a newline has been scanned).
|
|||
|
|
|||
|
`R$'
|
|||
|
an R, but only at the end of a line (i.e., just before a newline).
|
|||
|
Equivalent to "R/\n".
|
|||
|
|
|||
|
Note that flex's notion of "newline" is exactly whatever the C
|
|||
|
compiler used to compile flex interprets '\n' as; in particular,
|
|||
|
on some DOS systems you must either filter out \r's in the input
|
|||
|
yourself, or explicitly use R/\r\n for "r$".
|
|||
|
|
|||
|
`<S>R'
|
|||
|
an R, but only in start condition S (see below for discussion of
|
|||
|
start conditions) <S1,S2,S3>R same, but in any of start conditions
|
|||
|
S1, S2, or S3
|
|||
|
|
|||
|
`<*>R'
|
|||
|
an R in any start condition, even an exclusive one.
|
|||
|
|
|||
|
`<<EOF>>'
|
|||
|
an end-of-file <S1,S2><<EOF>> an end-of-file when in start
|
|||
|
condition S1 or S2
|
|||
|
|
|||
|
Note that inside of a character class, all regular expression
|
|||
|
operators lose their special meaning except escape ('\') and the
|
|||
|
character class operators, '-', ']', and, at the beginning of the
|
|||
|
class, '^'.
|
|||
|
|
|||
|
The regular expressions listed above are grouped according to
|
|||
|
precedence, from highest precedence at the top to lowest at the bottom.
|
|||
|
Those grouped together have equal precedence. For example,
|
|||
|
|
|||
|
foo|bar*
|
|||
|
|
|||
|
is the same as
|
|||
|
|
|||
|
(foo)|(ba(r*))
|
|||
|
|
|||
|
since the '*' operator has higher precedence than concatenation, and
|
|||
|
concatenation higher than alternation ('|'). This pattern therefore
|
|||
|
matches *either* the string "foo" *or* the string "ba" followed by
|
|||
|
zero-or-more r's. To match "foo" or zero-or-more "bar"'s, use:
|
|||
|
|
|||
|
foo|(bar)*
|
|||
|
|
|||
|
and to match zero-or-more "foo"'s-or-"bar"'s:
|
|||
|
|
|||
|
(foo|bar)*
|
|||
|
|
|||
|
In addition to characters and ranges of characters, character
|
|||
|
classes can also contain character class "expressions". These are
|
|||
|
expressions enclosed inside `[': and `:'] delimiters (which themselves
|
|||
|
must appear between the '[' and ']' of the character class; other
|
|||
|
elements may occur inside the character class, too). The valid
|
|||
|
expressions are:
|
|||
|
|
|||
|
[:alnum:] [:alpha:] [:blank:]
|
|||
|
[:cntrl:] [:digit:] [:graph:]
|
|||
|
[:lower:] [:print:] [:punct:]
|
|||
|
[:space:] [:upper:] [:xdigit:]
|
|||
|
|
|||
|
These expressions all designate a set of characters equivalent to
|
|||
|
the corresponding standard C `isXXX' function. For example,
|
|||
|
`[:alnum:]' designates those characters for which `isalnum()' returns
|
|||
|
true - i.e., any alphabetic or numeric. Some systems don't provide
|
|||
|
`isblank()', so flex defines `[:blank:]' as a blank or a tab.
|
|||
|
|
|||
|
For example, the following character classes are all equivalent:
|
|||
|
|
|||
|
[[:alnum:]]
|
|||
|
[[:alpha:][:digit:]
|
|||
|
[[:alpha:]0-9]
|
|||
|
[a-zA-Z0-9]
|
|||
|
|
|||
|
If your scanner is case-insensitive (the `-i' flag), then
|
|||
|
`[:upper:]' and `[:lower:]' are equivalent to `[:alpha:]'.
|
|||
|
|
|||
|
Some notes on patterns:
|
|||
|
|
|||
|
- A negated character class such as the example "[^A-Z]" above *will
|
|||
|
match a newline* unless "\n" (or an equivalent escape sequence) is
|
|||
|
one of the characters explicitly present in the negated character
|
|||
|
class (e.g., "[^A-Z\n]"). This is unlike how many other regular
|
|||
|
expression tools treat negated character classes, but
|
|||
|
unfortunately the inconsistency is historically entrenched.
|
|||
|
Matching newlines means that a pattern like [^"]* can match the
|
|||
|
entire input unless there's another quote in the input.
|
|||
|
|
|||
|
- A rule can have at most one instance of trailing context (the '/'
|
|||
|
operator or the '$' operator). The start condition, '^', and
|
|||
|
"<<EOF>>" patterns can only occur at the beginning of a pattern,
|
|||
|
and, as well as with '/' and '$', cannot be grouped inside
|
|||
|
parentheses. A '^' which does not occur at the beginning of a
|
|||
|
rule or a '$' which does not occur at the end of a rule loses its
|
|||
|
special properties and is treated as a normal character.
|
|||
|
|
|||
|
The following are illegal:
|
|||
|
|
|||
|
foo/bar$
|
|||
|
<sc1>foo<sc2>bar
|
|||
|
|
|||
|
Note that the first of these, can be written "foo/bar\n".
|
|||
|
|
|||
|
The following will result in '$' or '^' being treated as a normal
|
|||
|
character:
|
|||
|
|
|||
|
foo|(bar$)
|
|||
|
foo|^bar
|
|||
|
|
|||
|
If what's wanted is a "foo" or a bar-followed-by-a-newline, the
|
|||
|
following could be used (the special '|' action is explained
|
|||
|
below):
|
|||
|
|
|||
|
foo |
|
|||
|
bar$ /* action goes here */
|
|||
|
|
|||
|
A similar trick will work for matching a foo or a
|
|||
|
bar-at-the-beginning-of-a-line.
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Matching, Next: Actions, Prev: Patterns, Up: Top
|
|||
|
|
|||
|
How the input is matched
|
|||
|
========================
|
|||
|
|
|||
|
When the generated scanner is run, it analyzes its input looking for
|
|||
|
strings which match any of its patterns. If it finds more than one
|
|||
|
match, it takes the one matching the most text (for trailing context
|
|||
|
rules, this includes the length of the trailing part, even though it
|
|||
|
will then be returned to the input). If it finds two or more matches
|
|||
|
of the same length, the rule listed first in the `flex' input file is
|
|||
|
chosen.
|
|||
|
|
|||
|
Once the match is determined, the text corresponding to the match
|
|||
|
(called the TOKEN) is made available in the global character pointer
|
|||
|
`yytext', and its length in the global integer `yyleng'. The ACTION
|
|||
|
corresponding to the matched pattern is then executed (a more detailed
|
|||
|
description of actions follows), and then the remaining input is
|
|||
|
scanned for another match.
|
|||
|
|
|||
|
If no match is found, then the "default rule" is executed: the next
|
|||
|
character in the input is considered matched and copied to the standard
|
|||
|
output. Thus, the simplest legal `flex' input is:
|
|||
|
|
|||
|
%%
|
|||
|
|
|||
|
which generates a scanner that simply copies its input (one
|
|||
|
character at a time) to its output.
|
|||
|
|
|||
|
Note that `yytext' can be defined in two different ways: either as a
|
|||
|
character *pointer* or as a character *array*. You can control which
|
|||
|
definition `flex' uses by including one of the special directives
|
|||
|
`%pointer' or `%array' in the first (definitions) section of your flex
|
|||
|
input. The default is `%pointer', unless you use the `-l' lex
|
|||
|
compatibility option, in which case `yytext' will be an array. The
|
|||
|
advantage of using `%pointer' is substantially faster scanning and no
|
|||
|
buffer overflow when matching very large tokens (unless you run out of
|
|||
|
dynamic memory). The disadvantage is that you are restricted in how
|
|||
|
your actions can modify `yytext' (see the next section), and calls to
|
|||
|
the `unput()' function destroys the present contents of `yytext', which
|
|||
|
can be a considerable porting headache when moving between different
|
|||
|
`lex' versions.
|
|||
|
|
|||
|
The advantage of `%array' is that you can then modify `yytext' to
|
|||
|
your heart's content, and calls to `unput()' do not destroy `yytext'
|
|||
|
(see below). Furthermore, existing `lex' programs sometimes access
|
|||
|
`yytext' externally using declarations of the form:
|
|||
|
extern char yytext[];
|
|||
|
This definition is erroneous when used with `%pointer', but correct
|
|||
|
for `%array'.
|
|||
|
|
|||
|
`%array' defines `yytext' to be an array of `YYLMAX' characters,
|
|||
|
which defaults to a fairly large value. You can change the size by
|
|||
|
simply #define'ing `YYLMAX' to a different value in the first section
|
|||
|
of your `flex' input. As mentioned above, with `%pointer' yytext grows
|
|||
|
dynamically to accommodate large tokens. While this means your
|
|||
|
`%pointer' scanner can accommodate very large tokens (such as matching
|
|||
|
entire blocks of comments), bear in mind that each time the scanner
|
|||
|
must resize `yytext' it also must rescan the entire token from the
|
|||
|
beginning, so matching such tokens can prove slow. `yytext' presently
|
|||
|
does *not* dynamically grow if a call to `unput()' results in too much
|
|||
|
text being pushed back; instead, a run-time error results.
|
|||
|
|
|||
|
Also note that you cannot use `%array' with C++ scanner classes (the
|
|||
|
`c++' option; see below).
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Actions, Next: Generated scanner, Prev: Matching, Up: Top
|
|||
|
|
|||
|
Actions
|
|||
|
=======
|
|||
|
|
|||
|
Each pattern in a rule has a corresponding action, which can be any
|
|||
|
arbitrary C statement. The pattern ends at the first non-escaped
|
|||
|
whitespace character; the remainder of the line is its action. If the
|
|||
|
action is empty, then when the pattern is matched the input token is
|
|||
|
simply discarded. For example, here is the specification for a program
|
|||
|
which deletes all occurrences of "zap me" from its input:
|
|||
|
|
|||
|
%%
|
|||
|
"zap me"
|
|||
|
|
|||
|
(It will copy all other characters in the input to the output since
|
|||
|
they will be matched by the default rule.)
|
|||
|
|
|||
|
Here is a program which compresses multiple blanks and tabs down to
|
|||
|
a single blank, and throws away whitespace found at the end of a line:
|
|||
|
|
|||
|
%%
|
|||
|
[ \t]+ putchar( ' ' );
|
|||
|
[ \t]+$ /* ignore this token */
|
|||
|
|
|||
|
If the action contains a '{', then the action spans till the
|
|||
|
balancing '}' is found, and the action may cross multiple lines.
|
|||
|
`flex' knows about C strings and comments and won't be fooled by braces
|
|||
|
found within them, but also allows actions to begin with `%{' and will
|
|||
|
consider the action to be all the text up to the next `%}' (regardless
|
|||
|
of ordinary braces inside the action).
|
|||
|
|
|||
|
An action consisting solely of a vertical bar ('|') means "same as
|
|||
|
the action for the next rule." See below for an illustration.
|
|||
|
|
|||
|
Actions can include arbitrary C code, including `return' statements
|
|||
|
to return a value to whatever routine called `yylex()'. Each time
|
|||
|
`yylex()' is called it continues processing tokens from where it last
|
|||
|
left off until it either reaches the end of the file or executes a
|
|||
|
return.
|
|||
|
|
|||
|
Actions are free to modify `yytext' except for lengthening it
|
|||
|
(adding characters to its end-these will overwrite later characters in
|
|||
|
the input stream). This however does not apply when using `%array'
|
|||
|
(see above); in that case, `yytext' may be freely modified in any way.
|
|||
|
|
|||
|
Actions are free to modify `yyleng' except they should not do so if
|
|||
|
the action also includes use of `yymore()' (see below).
|
|||
|
|
|||
|
There are a number of special directives which can be included
|
|||
|
within an action:
|
|||
|
|
|||
|
- `ECHO' copies yytext to the scanner's output.
|
|||
|
|
|||
|
- `BEGIN' followed by the name of a start condition places the
|
|||
|
scanner in the corresponding start condition (see below).
|
|||
|
|
|||
|
- `REJECT' directs the scanner to proceed on to the "second best"
|
|||
|
rule which matched the input (or a prefix of the input). The rule
|
|||
|
is chosen as described above in "How the Input is Matched", and
|
|||
|
`yytext' and `yyleng' set up appropriately. It may either be one
|
|||
|
which matched as much text as the originally chosen rule but came
|
|||
|
later in the `flex' input file, or one which matched less text.
|
|||
|
For example, the following will both count the words in the input
|
|||
|
and call the routine special() whenever "frob" is seen:
|
|||
|
|
|||
|
int word_count = 0;
|
|||
|
%%
|
|||
|
|
|||
|
frob special(); REJECT;
|
|||
|
[^ \t\n]+ ++word_count;
|
|||
|
|
|||
|
Without the `REJECT', any "frob"'s in the input would not be
|
|||
|
counted as words, since the scanner normally executes only one
|
|||
|
action per token. Multiple `REJECT's' are allowed, each one
|
|||
|
finding the next best choice to the currently active rule. For
|
|||
|
example, when the following scanner scans the token "abcd", it
|
|||
|
will write "abcdabcaba" to the output:
|
|||
|
|
|||
|
%%
|
|||
|
a |
|
|||
|
ab |
|
|||
|
abc |
|
|||
|
abcd ECHO; REJECT;
|
|||
|
.|\n /* eat up any unmatched character */
|
|||
|
|
|||
|
(The first three rules share the fourth's action since they use
|
|||
|
the special '|' action.) `REJECT' is a particularly expensive
|
|||
|
feature in terms of scanner performance; if it is used in *any* of
|
|||
|
the scanner's actions it will slow down *all* of the scanner's
|
|||
|
matching. Furthermore, `REJECT' cannot be used with the `-Cf' or
|
|||
|
`-CF' options (see below).
|
|||
|
|
|||
|
Note also that unlike the other special actions, `REJECT' is a
|
|||
|
*branch*; code immediately following it in the action will *not*
|
|||
|
be executed.
|
|||
|
|
|||
|
- `yymore()' tells the scanner that the next time it matches a rule,
|
|||
|
the corresponding token should be *appended* onto the current
|
|||
|
value of `yytext' rather than replacing it. For example, given
|
|||
|
the input "mega-kludge" the following will write
|
|||
|
"mega-mega-kludge" to the output:
|
|||
|
|
|||
|
%%
|
|||
|
mega- ECHO; yymore();
|
|||
|
kludge ECHO;
|
|||
|
|
|||
|
First "mega-" is matched and echoed to the output. Then "kludge"
|
|||
|
is matched, but the previous "mega-" is still hanging around at
|
|||
|
the beginning of `yytext' so the `ECHO' for the "kludge" rule will
|
|||
|
actually write "mega-kludge".
|
|||
|
|
|||
|
Two notes regarding use of `yymore()'. First, `yymore()' depends on
|
|||
|
the value of `yyleng' correctly reflecting the size of the current
|
|||
|
token, so you must not modify `yyleng' if you are using `yymore()'.
|
|||
|
Second, the presence of `yymore()' in the scanner's action entails a
|
|||
|
minor performance penalty in the scanner's matching speed.
|
|||
|
|
|||
|
- `yyless(n)' returns all but the first N characters of the current
|
|||
|
token back to the input stream, where they will be rescanned when
|
|||
|
the scanner looks for the next match. `yytext' and `yyleng' are
|
|||
|
adjusted appropriately (e.g., `yyleng' will now be equal to N ).
|
|||
|
For example, on the input "foobar" the following will write out
|
|||
|
"foobarbar":
|
|||
|
|
|||
|
%%
|
|||
|
foobar ECHO; yyless(3);
|
|||
|
[a-z]+ ECHO;
|
|||
|
|
|||
|
An argument of 0 to `yyless' will cause the entire current input
|
|||
|
string to be scanned again. Unless you've changed how the scanner
|
|||
|
will subsequently process its input (using `BEGIN', for example),
|
|||
|
this will result in an endless loop.
|
|||
|
|
|||
|
Note that `yyless' is a macro and can only be used in the flex
|
|||
|
input file, not from other source files.
|
|||
|
|
|||
|
- `unput(c)' puts the character `c' back onto the input stream. It
|
|||
|
will be the next character scanned. The following action will
|
|||
|
take the current token and cause it to be rescanned enclosed in
|
|||
|
parentheses.
|
|||
|
|
|||
|
{
|
|||
|
int i;
|
|||
|
/* Copy yytext because unput() trashes yytext */
|
|||
|
char *yycopy = strdup( yytext );
|
|||
|
unput( ')' );
|
|||
|
for ( i = yyleng - 1; i >= 0; --i )
|
|||
|
unput( yycopy[i] );
|
|||
|
unput( '(' );
|
|||
|
free( yycopy );
|
|||
|
}
|
|||
|
|
|||
|
Note that since each `unput()' puts the given character back at
|
|||
|
the *beginning* of the input stream, pushing back strings must be
|
|||
|
done back-to-front. An important potential problem when using
|
|||
|
`unput()' is that if you are using `%pointer' (the default), a
|
|||
|
call to `unput()' *destroys* the contents of `yytext', starting
|
|||
|
with its rightmost character and devouring one character to the
|
|||
|
left with each call. If you need the value of yytext preserved
|
|||
|
after a call to `unput()' (as in the above example), you must
|
|||
|
either first copy it elsewhere, or build your scanner using
|
|||
|
`%array' instead (see How The Input Is Matched).
|
|||
|
|
|||
|
Finally, note that you cannot put back `EOF' to attempt to mark
|
|||
|
the input stream with an end-of-file.
|
|||
|
|
|||
|
- `input()' reads the next character from the input stream. For
|
|||
|
example, the following is one way to eat up C comments:
|
|||
|
|
|||
|
%%
|
|||
|
"/*" {
|
|||
|
register int c;
|
|||
|
|
|||
|
for ( ; ; )
|
|||
|
{
|
|||
|
while ( (c = input()) != '*' &&
|
|||
|
c != EOF )
|
|||
|
; /* eat up text of comment */
|
|||
|
|
|||
|
if ( c == '*' )
|
|||
|
{
|
|||
|
while ( (c = input()) == '*' )
|
|||
|
;
|
|||
|
if ( c == '/' )
|
|||
|
break; /* found the end */
|
|||
|
}
|
|||
|
|
|||
|
if ( c == EOF )
|
|||
|
{
|
|||
|
error( "EOF in comment" );
|
|||
|
break;
|
|||
|
}
|
|||
|
}
|
|||
|
}
|
|||
|
|
|||
|
(Note that if the scanner is compiled using `C++', then `input()'
|
|||
|
is instead referred to as `yyinput()', in order to avoid a name
|
|||
|
clash with the `C++' stream by the name of `input'.)
|
|||
|
|
|||
|
- YY_FLUSH_BUFFER flushes the scanner's internal buffer so that the
|
|||
|
next time the scanner attempts to match a token, it will first
|
|||
|
refill the buffer using `YY_INPUT' (see The Generated Scanner,
|
|||
|
below). This action is a special case of the more general
|
|||
|
`yy_flush_buffer()' function, described below in the section
|
|||
|
Multiple Input Buffers.
|
|||
|
|
|||
|
- `yyterminate()' can be used in lieu of a return statement in an
|
|||
|
action. It terminates the scanner and returns a 0 to the
|
|||
|
scanner's caller, indicating "all done". By default,
|
|||
|
`yyterminate()' is also called when an end-of-file is encountered.
|
|||
|
It is a macro and may be redefined.
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Generated scanner, Next: Start conditions, Prev: Actions, Up: Top
|
|||
|
|
|||
|
The generated scanner
|
|||
|
=====================
|
|||
|
|
|||
|
The output of `flex' is the file `lex.yy.c', which contains the
|
|||
|
scanning routine `yylex()', a number of tables used by it for matching
|
|||
|
tokens, and a number of auxiliary routines and macros. By default,
|
|||
|
`yylex()' is declared as follows:
|
|||
|
|
|||
|
int yylex()
|
|||
|
{
|
|||
|
... various definitions and the actions in here ...
|
|||
|
}
|
|||
|
|
|||
|
(If your environment supports function prototypes, then it will be
|
|||
|
"int yylex( void )".) This definition may be changed by defining
|
|||
|
the "YY_DECL" macro. For example, you could use:
|
|||
|
|
|||
|
#define YY_DECL float lexscan( a, b ) float a, b;
|
|||
|
|
|||
|
to give the scanning routine the name `lexscan', returning a float,
|
|||
|
and taking two floats as arguments. Note that if you give arguments to
|
|||
|
the scanning routine using a K&R-style/non-prototyped function
|
|||
|
declaration, you must terminate the definition with a semi-colon (`;').
|
|||
|
|
|||
|
Whenever `yylex()' is called, it scans tokens from the global input
|
|||
|
file `yyin' (which defaults to stdin). It continues until it either
|
|||
|
reaches an end-of-file (at which point it returns the value 0) or one
|
|||
|
of its actions executes a `return' statement.
|
|||
|
|
|||
|
If the scanner reaches an end-of-file, subsequent calls are undefined
|
|||
|
unless either `yyin' is pointed at a new input file (in which case
|
|||
|
scanning continues from that file), or `yyrestart()' is called.
|
|||
|
`yyrestart()' takes one argument, a `FILE *' pointer (which can be nil,
|
|||
|
if you've set up `YY_INPUT' to scan from a source other than `yyin'),
|
|||
|
and initializes `yyin' for scanning from that file. Essentially there
|
|||
|
is no difference between just assigning `yyin' to a new input file or
|
|||
|
using `yyrestart()' to do so; the latter is available for compatibility
|
|||
|
with previous versions of `flex', and because it can be used to switch
|
|||
|
input files in the middle of scanning. It can also be used to throw
|
|||
|
away the current input buffer, by calling it with an argument of
|
|||
|
`yyin'; but better is to use `YY_FLUSH_BUFFER' (see above). Note that
|
|||
|
`yyrestart()' does *not* reset the start condition to `INITIAL' (see
|
|||
|
Start Conditions, below).
|
|||
|
|
|||
|
If `yylex()' stops scanning due to executing a `return' statement in
|
|||
|
one of the actions, the scanner may then be called again and it will
|
|||
|
resume scanning where it left off.
|
|||
|
|
|||
|
By default (and for purposes of efficiency), the scanner uses
|
|||
|
block-reads rather than simple `getc()' calls to read characters from
|
|||
|
`yyin'. The nature of how it gets its input can be controlled by
|
|||
|
defining the `YY_INPUT' macro. YY_INPUT's calling sequence is
|
|||
|
"YY_INPUT(buf,result,max_size)". Its action is to place up to MAX_SIZE
|
|||
|
characters in the character array BUF and return in the integer
|
|||
|
variable RESULT either the number of characters read or the constant
|
|||
|
YY_NULL (0 on Unix systems) to indicate EOF. The default YY_INPUT
|
|||
|
reads from the global file-pointer "yyin".
|
|||
|
|
|||
|
A sample definition of YY_INPUT (in the definitions section of the
|
|||
|
input file):
|
|||
|
|
|||
|
%{
|
|||
|
#define YY_INPUT(buf,result,max_size) \
|
|||
|
{ \
|
|||
|
int c = getchar(); \
|
|||
|
result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
|
|||
|
}
|
|||
|
%}
|
|||
|
|
|||
|
This definition will change the input processing to occur one
|
|||
|
character at a time.
|
|||
|
|
|||
|
When the scanner receives an end-of-file indication from YY_INPUT,
|
|||
|
it then checks the `yywrap()' function. If `yywrap()' returns false
|
|||
|
(zero), then it is assumed that the function has gone ahead and set up
|
|||
|
`yyin' to point to another input file, and scanning continues. If it
|
|||
|
returns true (non-zero), then the scanner terminates, returning 0 to
|
|||
|
its caller. Note that in either case, the start condition remains
|
|||
|
unchanged; it does *not* revert to `INITIAL'.
|
|||
|
|
|||
|
If you do not supply your own version of `yywrap()', then you must
|
|||
|
either use `%option noyywrap' (in which case the scanner behaves as
|
|||
|
though `yywrap()' returned 1), or you must link with `-lfl' to obtain
|
|||
|
the default version of the routine, which always returns 1.
|
|||
|
|
|||
|
Three routines are available for scanning from in-memory buffers
|
|||
|
rather than files: `yy_scan_string()', `yy_scan_bytes()', and
|
|||
|
`yy_scan_buffer()'. See the discussion of them below in the section
|
|||
|
Multiple Input Buffers.
|
|||
|
|
|||
|
The scanner writes its `ECHO' output to the `yyout' global (default,
|
|||
|
stdout), which may be redefined by the user simply by assigning it to
|
|||
|
some other `FILE' pointer.
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Start conditions, Next: Multiple buffers, Prev: Generated scanner, Up: Top
|
|||
|
|
|||
|
Start conditions
|
|||
|
================
|
|||
|
|
|||
|
`flex' provides a mechanism for conditionally activating rules. Any
|
|||
|
rule whose pattern is prefixed with "<sc>" will only be active when the
|
|||
|
scanner is in the start condition named "sc". For example,
|
|||
|
|
|||
|
<STRING>[^"]* { /* eat up the string body ... */
|
|||
|
...
|
|||
|
}
|
|||
|
|
|||
|
will be active only when the scanner is in the "STRING" start
|
|||
|
condition, and
|
|||
|
|
|||
|
<INITIAL,STRING,QUOTE>\. { /* handle an escape ... */
|
|||
|
...
|
|||
|
}
|
|||
|
|
|||
|
will be active only when the current start condition is either
|
|||
|
"INITIAL", "STRING", or "QUOTE".
|
|||
|
|
|||
|
Start conditions are declared in the definitions (first) section of
|
|||
|
the input using unindented lines beginning with either `%s' or `%x'
|
|||
|
followed by a list of names. The former declares *inclusive* start
|
|||
|
conditions, the latter *exclusive* start conditions. A start condition
|
|||
|
is activated using the `BEGIN' action. Until the next `BEGIN' action is
|
|||
|
executed, rules with the given start condition will be active and rules
|
|||
|
with other start conditions will be inactive. If the start condition
|
|||
|
is *inclusive*, then rules with no start conditions at all will also be
|
|||
|
active. If it is *exclusive*, then *only* rules qualified with the
|
|||
|
start condition will be active. A set of rules contingent on the same
|
|||
|
exclusive start condition describe a scanner which is independent of
|
|||
|
any of the other rules in the `flex' input. Because of this, exclusive
|
|||
|
start conditions make it easy to specify "mini-scanners" which scan
|
|||
|
portions of the input that are syntactically different from the rest
|
|||
|
(e.g., comments).
|
|||
|
|
|||
|
If the distinction between inclusive and exclusive start conditions
|
|||
|
is still a little vague, here's a simple example illustrating the
|
|||
|
connection between the two. The set of rules:
|
|||
|
|
|||
|
%s example
|
|||
|
%%
|
|||
|
|
|||
|
<example>foo do_something();
|
|||
|
|
|||
|
bar something_else();
|
|||
|
|
|||
|
is equivalent to
|
|||
|
|
|||
|
%x example
|
|||
|
%%
|
|||
|
|
|||
|
<example>foo do_something();
|
|||
|
|
|||
|
<INITIAL,example>bar something_else();
|
|||
|
|
|||
|
Without the `<INITIAL,example>' qualifier, the `bar' pattern in the
|
|||
|
second example wouldn't be active (i.e., couldn't match) when in start
|
|||
|
condition `example'. If we just used `<example>' to qualify `bar',
|
|||
|
though, then it would only be active in `example' and not in `INITIAL',
|
|||
|
while in the first example it's active in both, because in the first
|
|||
|
example the `example' starting condition is an *inclusive* (`%s') start
|
|||
|
condition.
|
|||
|
|
|||
|
Also note that the special start-condition specifier `<*>' matches
|
|||
|
every start condition. Thus, the above example could also have been
|
|||
|
written;
|
|||
|
|
|||
|
%x example
|
|||
|
%%
|
|||
|
|
|||
|
<example>foo do_something();
|
|||
|
|
|||
|
<*>bar something_else();
|
|||
|
|
|||
|
The default rule (to `ECHO' any unmatched character) remains active
|
|||
|
in start conditions. It is equivalent to:
|
|||
|
|
|||
|
<*>.|\\n ECHO;
|
|||
|
|
|||
|
`BEGIN(0)' returns to the original state where only the rules with
|
|||
|
no start conditions are active. This state can also be referred to as
|
|||
|
the start-condition "INITIAL", so `BEGIN(INITIAL)' is equivalent to
|
|||
|
`BEGIN(0)'. (The parentheses around the start condition name are not
|
|||
|
required but are considered good style.)
|
|||
|
|
|||
|
`BEGIN' actions can also be given as indented code at the beginning
|
|||
|
of the rules section. For example, the following will cause the
|
|||
|
scanner to enter the "SPECIAL" start condition whenever `yylex()' is
|
|||
|
called and the global variable `enter_special' is true:
|
|||
|
|
|||
|
int enter_special;
|
|||
|
|
|||
|
%x SPECIAL
|
|||
|
%%
|
|||
|
if ( enter_special )
|
|||
|
BEGIN(SPECIAL);
|
|||
|
|
|||
|
<SPECIAL>blahblahblah
|
|||
|
...more rules follow...
|
|||
|
|
|||
|
To illustrate the uses of start conditions, here is a scanner which
|
|||
|
provides two different interpretations of a string like "123.456". By
|
|||
|
default it will treat it as as three tokens, the integer "123", a dot
|
|||
|
('.'), and the integer "456". But if the string is preceded earlier in
|
|||
|
the line by the string "expect-floats" it will treat it as a single
|
|||
|
token, the floating-point number 123.456:
|
|||
|
|
|||
|
%{
|
|||
|
#include <math.h>
|
|||
|
%}
|
|||
|
%s expect
|
|||
|
|
|||
|
%%
|
|||
|
expect-floats BEGIN(expect);
|
|||
|
|
|||
|
<expect>[0-9]+"."[0-9]+ {
|
|||
|
printf( "found a float, = %f\n",
|
|||
|
atof( yytext ) );
|
|||
|
}
|
|||
|
<expect>\n {
|
|||
|
/* that's the end of the line, so
|
|||
|
* we need another "expect-number"
|
|||
|
* before we'll recognize any more
|
|||
|
* numbers
|
|||
|
*/
|
|||
|
BEGIN(INITIAL);
|
|||
|
}
|
|||
|
|
|||
|
[0-9]+ {
|
|||
|
|
|||
|
Version 2.5 December 1994 18
|
|||
|
|
|||
|
printf( "found an integer, = %d\n",
|
|||
|
atoi( yytext ) );
|
|||
|
}
|
|||
|
|
|||
|
"." printf( "found a dot\n" );
|
|||
|
|
|||
|
Here is a scanner which recognizes (and discards) C comments while
|
|||
|
maintaining a count of the current input line.
|
|||
|
|
|||
|
%x comment
|
|||
|
%%
|
|||
|
int line_num = 1;
|
|||
|
|
|||
|
"/*" BEGIN(comment);
|
|||
|
|
|||
|
<comment>[^*\n]* /* eat anything that's not a '*' */
|
|||
|
<comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */
|
|||
|
<comment>\n ++line_num;
|
|||
|
<comment>"*"+"/" BEGIN(INITIAL);
|
|||
|
|
|||
|
This scanner goes to a bit of trouble to match as much text as
|
|||
|
possible with each rule. In general, when attempting to write a
|
|||
|
high-speed scanner try to match as much possible in each rule, as it's
|
|||
|
a big win.
|
|||
|
|
|||
|
Note that start-conditions names are really integer values and can
|
|||
|
be stored as such. Thus, the above could be extended in the following
|
|||
|
fashion:
|
|||
|
|
|||
|
%x comment foo
|
|||
|
%%
|
|||
|
int line_num = 1;
|
|||
|
int comment_caller;
|
|||
|
|
|||
|
"/*" {
|
|||
|
comment_caller = INITIAL;
|
|||
|
BEGIN(comment);
|
|||
|
}
|
|||
|
|
|||
|
...
|
|||
|
|
|||
|
<foo>"/*" {
|
|||
|
comment_caller = foo;
|
|||
|
BEGIN(comment);
|
|||
|
}
|
|||
|
|
|||
|
<comment>[^*\n]* /* eat anything that's not a '*' */
|
|||
|
<comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */
|
|||
|
<comment>\n ++line_num;
|
|||
|
<comment>"*"+"/" BEGIN(comment_caller);
|
|||
|
|
|||
|
Furthermore, you can access the current start condition using the
|
|||
|
integer-valued `YY_START' macro. For example, the above assignments to
|
|||
|
`comment_caller' could instead be written
|
|||
|
|
|||
|
comment_caller = YY_START;
|
|||
|
|
|||
|
Flex provides `YYSTATE' as an alias for `YY_START' (since that is
|
|||
|
what's used by AT&T `lex').
|
|||
|
|
|||
|
Note that start conditions do not have their own name-space; %s's
|
|||
|
and %x's declare names in the same fashion as #define's.
|
|||
|
|
|||
|
Finally, here's an example of how to match C-style quoted strings
|
|||
|
using exclusive start conditions, including expanded escape sequences
|
|||
|
(but not including checking for a string that's too long):
|
|||
|
|
|||
|
%x str
|
|||
|
|
|||
|
%%
|
|||
|
char string_buf[MAX_STR_CONST];
|
|||
|
char *string_buf_ptr;
|
|||
|
|
|||
|
\" string_buf_ptr = string_buf; BEGIN(str);
|
|||
|
|
|||
|
<str>\" { /* saw closing quote - all done */
|
|||
|
BEGIN(INITIAL);
|
|||
|
*string_buf_ptr = '\0';
|
|||
|
/* return string constant token type and
|
|||
|
* value to parser
|
|||
|
*/
|
|||
|
}
|
|||
|
|
|||
|
<str>\n {
|
|||
|
/* error - unterminated string constant */
|
|||
|
/* generate error message */
|
|||
|
}
|
|||
|
|
|||
|
<str>\\[0-7]{1,3} {
|
|||
|
/* octal escape sequence */
|
|||
|
int result;
|
|||
|
|
|||
|
(void) sscanf( yytext + 1, "%o", &result );
|
|||
|
|
|||
|
if ( result > 0xff )
|
|||
|
/* error, constant is out-of-bounds */
|
|||
|
|
|||
|
*string_buf_ptr++ = result;
|
|||
|
}
|
|||
|
|
|||
|
<str>\\[0-9]+ {
|
|||
|
/* generate error - bad escape sequence; something
|
|||
|
* like '\48' or '\0777777'
|
|||
|
*/
|
|||
|
}
|
|||
|
|
|||
|
<str>\\n *string_buf_ptr++ = '\n';
|
|||
|
<str>\\t *string_buf_ptr++ = '\t';
|
|||
|
<str>\\r *string_buf_ptr++ = '\r';
|
|||
|
<str>\\b *string_buf_ptr++ = '\b';
|
|||
|
<str>\\f *string_buf_ptr++ = '\f';
|
|||
|
|
|||
|
<str>\\(.|\n) *string_buf_ptr++ = yytext[1];
|
|||
|
|
|||
|
<str>[^\\\n\"]+ {
|
|||
|
char *yptr = yytext;
|
|||
|
|
|||
|
while ( *yptr )
|
|||
|
*string_buf_ptr++ = *yptr++;
|
|||
|
}
|
|||
|
|
|||
|
Often, such as in some of the examples above, you wind up writing a
|
|||
|
whole bunch of rules all preceded by the same start condition(s). Flex
|
|||
|
makes this a little easier and cleaner by introducing a notion of start
|
|||
|
condition "scope". A start condition scope is begun with:
|
|||
|
|
|||
|
<SCs>{
|
|||
|
|
|||
|
where SCs is a list of one or more start conditions. Inside the start
|
|||
|
condition scope, every rule automatically has the prefix `<SCs>'
|
|||
|
applied to it, until a `}' which matches the initial `{'. So, for
|
|||
|
example,
|
|||
|
|
|||
|
<ESC>{
|
|||
|
"\\n" return '\n';
|
|||
|
"\\r" return '\r';
|
|||
|
"\\f" return '\f';
|
|||
|
"\\0" return '\0';
|
|||
|
}
|
|||
|
|
|||
|
is equivalent to:
|
|||
|
|
|||
|
<ESC>"\\n" return '\n';
|
|||
|
<ESC>"\\r" return '\r';
|
|||
|
<ESC>"\\f" return '\f';
|
|||
|
<ESC>"\\0" return '\0';
|
|||
|
|
|||
|
Start condition scopes may be nested.
|
|||
|
|
|||
|
Three routines are available for manipulating stacks of start
|
|||
|
conditions:
|
|||
|
|
|||
|
`void yy_push_state(int new_state)'
|
|||
|
pushes the current start condition onto the top of the start
|
|||
|
condition stack and switches to NEW_STATE as though you had used
|
|||
|
`BEGIN new_state' (recall that start condition names are also
|
|||
|
integers).
|
|||
|
|
|||
|
`void yy_pop_state()'
|
|||
|
pops the top of the stack and switches to it via `BEGIN'.
|
|||
|
|
|||
|
`int yy_top_state()'
|
|||
|
returns the top of the stack without altering the stack's contents.
|
|||
|
|
|||
|
The start condition stack grows dynamically and so has no built-in
|
|||
|
size limitation. If memory is exhausted, program execution aborts.
|
|||
|
|
|||
|
To use start condition stacks, your scanner must include a `%option
|
|||
|
stack' directive (see Options below).
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Multiple buffers, Next: End-of-file rules, Prev: Start conditions, Up: Top
|
|||
|
|
|||
|
Multiple input buffers
|
|||
|
======================
|
|||
|
|
|||
|
Some scanners (such as those which support "include" files) require
|
|||
|
reading from several input streams. As `flex' scanners do a large
|
|||
|
amount of buffering, one cannot control where the next input will be
|
|||
|
read from by simply writing a `YY_INPUT' which is sensitive to the
|
|||
|
scanning context. `YY_INPUT' is only called when the scanner reaches
|
|||
|
the end of its buffer, which may be a long time after scanning a
|
|||
|
statement such as an "include" which requires switching the input
|
|||
|
source.
|
|||
|
|
|||
|
To negotiate these sorts of problems, `flex' provides a mechanism
|
|||
|
for creating and switching between multiple input buffers. An input
|
|||
|
buffer is created by using:
|
|||
|
|
|||
|
YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
|
|||
|
|
|||
|
which takes a `FILE' pointer and a size and creates a buffer associated
|
|||
|
with the given file and large enough to hold SIZE characters (when in
|
|||
|
doubt, use `YY_BUF_SIZE' for the size). It returns a `YY_BUFFER_STATE'
|
|||
|
handle, which may then be passed to other routines (see below). The
|
|||
|
`YY_BUFFER_STATE' type is a pointer to an opaque `struct'
|
|||
|
`yy_buffer_state' structure, so you may safely initialize
|
|||
|
YY_BUFFER_STATE variables to `((YY_BUFFER_STATE) 0)' if you wish, and
|
|||
|
also refer to the opaque structure in order to correctly declare input
|
|||
|
buffers in source files other than that of your scanner. Note that the
|
|||
|
`FILE' pointer in the call to `yy_create_buffer' is only used as the
|
|||
|
value of `yyin' seen by `YY_INPUT'; if you redefine `YY_INPUT' so it no
|
|||
|
longer uses `yyin', then you can safely pass a nil `FILE' pointer to
|
|||
|
`yy_create_buffer'. You select a particular buffer to scan from using:
|
|||
|
|
|||
|
void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
|
|||
|
|
|||
|
switches the scanner's input buffer so subsequent tokens will come
|
|||
|
from NEW_BUFFER. Note that `yy_switch_to_buffer()' may be used by
|
|||
|
`yywrap()' to set things up for continued scanning, instead of opening
|
|||
|
a new file and pointing `yyin' at it. Note also that switching input
|
|||
|
sources via either `yy_switch_to_buffer()' or `yywrap()' does *not*
|
|||
|
change the start condition.
|
|||
|
|
|||
|
void yy_delete_buffer( YY_BUFFER_STATE buffer )
|
|||
|
|
|||
|
is used to reclaim the storage associated with a buffer. You can also
|
|||
|
clear the current contents of a buffer using:
|
|||
|
|
|||
|
void yy_flush_buffer( YY_BUFFER_STATE buffer )
|
|||
|
|
|||
|
This function discards the buffer's contents, so the next time the
|
|||
|
scanner attempts to match a token from the buffer, it will first fill
|
|||
|
the buffer anew using `YY_INPUT'.
|
|||
|
|
|||
|
`yy_new_buffer()' is an alias for `yy_create_buffer()', provided for
|
|||
|
compatibility with the C++ use of `new' and `delete' for creating and
|
|||
|
destroying dynamic objects.
|
|||
|
|
|||
|
Finally, the `YY_CURRENT_BUFFER' macro returns a `YY_BUFFER_STATE'
|
|||
|
handle to the current buffer.
|
|||
|
|
|||
|
Here is an example of using these features for writing a scanner
|
|||
|
which expands include files (the `<<EOF>>' feature is discussed below):
|
|||
|
|
|||
|
/* the "incl" state is used for picking up the name
|
|||
|
* of an include file
|
|||
|
*/
|
|||
|
%x incl
|
|||
|
|
|||
|
%{
|
|||
|
#define MAX_INCLUDE_DEPTH 10
|
|||
|
YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
|
|||
|
int include_stack_ptr = 0;
|
|||
|
%}
|
|||
|
|
|||
|
%%
|
|||
|
include BEGIN(incl);
|
|||
|
|
|||
|
[a-z]+ ECHO;
|
|||
|
[^a-z\n]*\n? ECHO;
|
|||
|
|
|||
|
<incl>[ \t]* /* eat the whitespace */
|
|||
|
<incl>[^ \t\n]+ { /* got the include file name */
|
|||
|
if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
|
|||
|
{
|
|||
|
fprintf( stderr, "Includes nested too deeply" );
|
|||
|
exit( 1 );
|
|||
|
}
|
|||
|
|
|||
|
include_stack[include_stack_ptr++] =
|
|||
|
YY_CURRENT_BUFFER;
|
|||
|
|
|||
|
yyin = fopen( yytext, "r" );
|
|||
|
|
|||
|
if ( ! yyin )
|
|||
|
error( ... );
|
|||
|
|
|||
|
yy_switch_to_buffer(
|
|||
|
yy_create_buffer( yyin, YY_BUF_SIZE ) );
|
|||
|
|
|||
|
BEGIN(INITIAL);
|
|||
|
}
|
|||
|
|
|||
|
<<EOF>> {
|
|||
|
if ( --include_stack_ptr < 0 )
|
|||
|
{
|
|||
|
yyterminate();
|
|||
|
}
|
|||
|
|
|||
|
else
|
|||
|
{
|
|||
|
yy_delete_buffer( YY_CURRENT_BUFFER );
|
|||
|
yy_switch_to_buffer(
|
|||
|
include_stack[include_stack_ptr] );
|
|||
|
}
|
|||
|
}
|
|||
|
|
|||
|
Three routines are available for setting up input buffers for
|
|||
|
scanning in-memory strings instead of files. All of them create a new
|
|||
|
input buffer for scanning the string, and return a corresponding
|
|||
|
`YY_BUFFER_STATE' handle (which you should delete with
|
|||
|
`yy_delete_buffer()' when done with it). They also switch to the new
|
|||
|
buffer using `yy_switch_to_buffer()', so the next call to `yylex()' will
|
|||
|
start scanning the string.
|
|||
|
|
|||
|
`yy_scan_string(const char *str)'
|
|||
|
scans a NUL-terminated string.
|
|||
|
|
|||
|
`yy_scan_bytes(const char *bytes, int len)'
|
|||
|
scans `len' bytes (including possibly NUL's) starting at location
|
|||
|
BYTES.
|
|||
|
|
|||
|
Note that both of these functions create and scan a *copy* of the
|
|||
|
string or bytes. (This may be desirable, since `yylex()' modifies the
|
|||
|
contents of the buffer it is scanning.) You can avoid the copy by using:
|
|||
|
|
|||
|
`yy_scan_buffer(char *base, yy_size_t size)'
|
|||
|
which scans in place the buffer starting at BASE, consisting of
|
|||
|
SIZE bytes, the last two bytes of which *must* be
|
|||
|
`YY_END_OF_BUFFER_CHAR' (ASCII NUL). These last two bytes are not
|
|||
|
scanned; thus, scanning consists of `base[0]' through
|
|||
|
`base[size-2]', inclusive.
|
|||
|
|
|||
|
If you fail to set up BASE in this manner (i.e., forget the final
|
|||
|
two `YY_END_OF_BUFFER_CHAR' bytes), then `yy_scan_buffer()'
|
|||
|
returns a nil pointer instead of creating a new input buffer.
|
|||
|
|
|||
|
The type `yy_size_t' is an integral type to which you can cast an
|
|||
|
integer expression reflecting the size of the buffer.
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: End-of-file rules, Next: Miscellaneous, Prev: Multiple buffers, Up: Top
|
|||
|
|
|||
|
End-of-file rules
|
|||
|
=================
|
|||
|
|
|||
|
The special rule "<<EOF>>" indicates actions which are to be taken
|
|||
|
when an end-of-file is encountered and yywrap() returns non-zero (i.e.,
|
|||
|
indicates no further files to process). The action must finish by
|
|||
|
doing one of four things:
|
|||
|
|
|||
|
- assigning `yyin' to a new input file (in previous versions of
|
|||
|
flex, after doing the assignment you had to call the special
|
|||
|
action `YY_NEW_FILE'; this is no longer necessary);
|
|||
|
|
|||
|
- executing a `return' statement;
|
|||
|
|
|||
|
- executing the special `yyterminate()' action;
|
|||
|
|
|||
|
- or, switching to a new buffer using `yy_switch_to_buffer()' as
|
|||
|
shown in the example above.
|
|||
|
|
|||
|
<<EOF>> rules may not be used with other patterns; they may only be
|
|||
|
qualified with a list of start conditions. If an unqualified <<EOF>>
|
|||
|
rule is given, it applies to *all* start conditions which do not
|
|||
|
already have <<EOF>> actions. To specify an <<EOF>> rule for only the
|
|||
|
initial start condition, use
|
|||
|
|
|||
|
<INITIAL><<EOF>>
|
|||
|
|
|||
|
These rules are useful for catching things like unclosed comments.
|
|||
|
An example:
|
|||
|
|
|||
|
%x quote
|
|||
|
%%
|
|||
|
|
|||
|
...other rules for dealing with quotes...
|
|||
|
|
|||
|
<quote><<EOF>> {
|
|||
|
error( "unterminated quote" );
|
|||
|
yyterminate();
|
|||
|
}
|
|||
|
<<EOF>> {
|
|||
|
if ( *++filelist )
|
|||
|
yyin = fopen( *filelist, "r" );
|
|||
|
else
|
|||
|
yyterminate();
|
|||
|
}
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Miscellaneous, Next: User variables, Prev: End-of-file rules, Up: Top
|
|||
|
|
|||
|
Miscellaneous macros
|
|||
|
====================
|
|||
|
|
|||
|
The macro `YY_USER_ACTION' can be defined to provide an action which
|
|||
|
is always executed prior to the matched rule's action. For example, it
|
|||
|
could be #define'd to call a routine to convert yytext to lower-case.
|
|||
|
When `YY_USER_ACTION' is invoked, the variable `yy_act' gives the
|
|||
|
number of the matched rule (rules are numbered starting with 1).
|
|||
|
Suppose you want to profile how often each of your rules is matched.
|
|||
|
The following would do the trick:
|
|||
|
|
|||
|
#define YY_USER_ACTION ++ctr[yy_act]
|
|||
|
|
|||
|
where `ctr' is an array to hold the counts for the different rules.
|
|||
|
Note that the macro `YY_NUM_RULES' gives the total number of rules
|
|||
|
(including the default rule, even if you use `-s', so a correct
|
|||
|
declaration for `ctr' is:
|
|||
|
|
|||
|
int ctr[YY_NUM_RULES];
|
|||
|
|
|||
|
The macro `YY_USER_INIT' may be defined to provide an action which
|
|||
|
is always executed before the first scan (and before the scanner's
|
|||
|
internal initializations are done). For example, it could be used to
|
|||
|
call a routine to read in a data table or open a logging file.
|
|||
|
|
|||
|
The macro `yy_set_interactive(is_interactive)' can be used to
|
|||
|
control whether the current buffer is considered *interactive*. An
|
|||
|
interactive buffer is processed more slowly, but must be used when the
|
|||
|
scanner's input source is indeed interactive to avoid problems due to
|
|||
|
waiting to fill buffers (see the discussion of the `-I' flag below). A
|
|||
|
non-zero value in the macro invocation marks the buffer as interactive,
|
|||
|
a zero value as non-interactive. Note that use of this macro overrides
|
|||
|
`%option always-interactive' or `%option never-interactive' (see
|
|||
|
Options below). `yy_set_interactive()' must be invoked prior to
|
|||
|
beginning to scan the buffer that is (or is not) to be considered
|
|||
|
interactive.
|
|||
|
|
|||
|
The macro `yy_set_bol(at_bol)' can be used to control whether the
|
|||
|
current buffer's scanning context for the next token match is done as
|
|||
|
though at the beginning of a line. A non-zero macro argument makes
|
|||
|
rules anchored with
|
|||
|
|
|||
|
The macro `YY_AT_BOL()' returns true if the next token scanned from
|
|||
|
the current buffer will have '^' rules active, false otherwise.
|
|||
|
|
|||
|
In the generated scanner, the actions are all gathered in one large
|
|||
|
switch statement and separated using `YY_BREAK', which may be
|
|||
|
redefined. By default, it is simply a "break", to separate each rule's
|
|||
|
action from the following rule's. Redefining `YY_BREAK' allows, for
|
|||
|
example, C++ users to #define YY_BREAK to do nothing (while being very
|
|||
|
careful that every rule ends with a "break" or a "return"!) to avoid
|
|||
|
suffering from unreachable statement warnings where because a rule's
|
|||
|
action ends with "return", the `YY_BREAK' is inaccessible.
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: User variables, Next: YACC interface, Prev: Miscellaneous, Up: Top
|
|||
|
|
|||
|
Values available to the user
|
|||
|
============================
|
|||
|
|
|||
|
This section summarizes the various values available to the user in
|
|||
|
the rule actions.
|
|||
|
|
|||
|
- `char *yytext' holds the text of the current token. It may be
|
|||
|
modified but not lengthened (you cannot append characters to the
|
|||
|
end).
|
|||
|
|
|||
|
If the special directive `%array' appears in the first section of
|
|||
|
the scanner description, then `yytext' is instead declared `char
|
|||
|
yytext[YYLMAX]', where `YYLMAX' is a macro definition that you can
|
|||
|
redefine in the first section if you don't like the default value
|
|||
|
(generally 8KB). Using `%array' results in somewhat slower
|
|||
|
scanners, but the value of `yytext' becomes immune to calls to
|
|||
|
`input()' and `unput()', which potentially destroy its value when
|
|||
|
`yytext' is a character pointer. The opposite of `%array' is
|
|||
|
`%pointer', which is the default.
|
|||
|
|
|||
|
You cannot use `%array' when generating C++ scanner classes (the
|
|||
|
`-+' flag).
|
|||
|
|
|||
|
- `int yyleng' holds the length of the current token.
|
|||
|
|
|||
|
- `FILE *yyin' is the file which by default `flex' reads from. It
|
|||
|
may be redefined but doing so only makes sense before scanning
|
|||
|
begins or after an EOF has been encountered. Changing it in the
|
|||
|
midst of scanning will have unexpected results since `flex'
|
|||
|
buffers its input; use `yyrestart()' instead. Once scanning
|
|||
|
terminates because an end-of-file has been seen, you can assign
|
|||
|
`yyin' at the new input file and then call the scanner again to
|
|||
|
continue scanning.
|
|||
|
|
|||
|
- `void yyrestart( FILE *new_file )' may be called to point `yyin'
|
|||
|
at the new input file. The switch-over to the new file is
|
|||
|
immediate (any previously buffered-up input is lost). Note that
|
|||
|
calling `yyrestart()' with `yyin' as an argument thus throws away
|
|||
|
the current input buffer and continues scanning the same input
|
|||
|
file.
|
|||
|
|
|||
|
- `FILE *yyout' is the file to which `ECHO' actions are done. It
|
|||
|
can be reassigned by the user.
|
|||
|
|
|||
|
- `YY_CURRENT_BUFFER' returns a `YY_BUFFER_STATE' handle to the
|
|||
|
current buffer.
|
|||
|
|
|||
|
- `YY_START' returns an integer value corresponding to the current
|
|||
|
start condition. You can subsequently use this value with `BEGIN'
|
|||
|
to return to that start condition.
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: YACC interface, Next: Options, Prev: User variables, Up: Top
|
|||
|
|
|||
|
Interfacing with `yacc'
|
|||
|
=======================
|
|||
|
|
|||
|
One of the main uses of `flex' is as a companion to the `yacc'
|
|||
|
parser-generator. `yacc' parsers expect to call a routine named
|
|||
|
`yylex()' to find the next input token. The routine is supposed to
|
|||
|
return the type of the next token as well as putting any associated
|
|||
|
value in the global `yylval'. To use `flex' with `yacc', one specifies
|
|||
|
the `-d' option to `yacc' to instruct it to generate the file `y.tab.h'
|
|||
|
containing definitions of all the `%tokens' appearing in the `yacc'
|
|||
|
input. This file is then included in the `flex' scanner. For example,
|
|||
|
if one of the tokens is "TOK_NUMBER", part of the scanner might look
|
|||
|
like:
|
|||
|
|
|||
|
%{
|
|||
|
#include "y.tab.h"
|
|||
|
%}
|
|||
|
|
|||
|
%%
|
|||
|
|
|||
|
[0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Options, Next: Performance, Prev: YACC interface, Up: Top
|
|||
|
|
|||
|
Options
|
|||
|
=======
|
|||
|
|
|||
|
`flex' has the following options:
|
|||
|
|
|||
|
`-b'
|
|||
|
Generate backing-up information to `lex.backup'. This is a list
|
|||
|
of scanner states which require backing up and the input
|
|||
|
characters on which they do so. By adding rules one can remove
|
|||
|
backing-up states. If *all* backing-up states are eliminated and
|
|||
|
`-Cf' or `-CF' is used, the generated scanner will run faster (see
|
|||
|
the `-p' flag). Only users who wish to squeeze every last cycle
|
|||
|
out of their scanners need worry about this option. (See the
|
|||
|
section on Performance Considerations below.)
|
|||
|
|
|||
|
`-c'
|
|||
|
is a do-nothing, deprecated option included for POSIX compliance.
|
|||
|
|
|||
|
`-d'
|
|||
|
makes the generated scanner run in "debug" mode. Whenever a
|
|||
|
pattern is recognized and the global `yy_flex_debug' is non-zero
|
|||
|
(which is the default), the scanner will write to `stderr' a line
|
|||
|
of the form:
|
|||
|
|
|||
|
--accepting rule at line 53 ("the matched text")
|
|||
|
|
|||
|
The line number refers to the location of the rule in the file
|
|||
|
defining the scanner (i.e., the file that was fed to flex).
|
|||
|
Messages are also generated when the scanner backs up, accepts the
|
|||
|
default rule, reaches the end of its input buffer (or encounters a
|
|||
|
NUL; at this point, the two look the same as far as the scanner's
|
|||
|
concerned), or reaches an end-of-file.
|
|||
|
|
|||
|
`-f'
|
|||
|
specifies "fast scanner". No table compression is done and stdio
|
|||
|
is bypassed. The result is large but fast. This option is
|
|||
|
equivalent to `-Cfr' (see below).
|
|||
|
|
|||
|
`-h'
|
|||
|
generates a "help" summary of `flex's' options to `stdout' and
|
|||
|
then exits. `-?' and `--help' are synonyms for `-h'.
|
|||
|
|
|||
|
`-i'
|
|||
|
instructs `flex' to generate a *case-insensitive* scanner. The
|
|||
|
case of letters given in the `flex' input patterns will be
|
|||
|
ignored, and tokens in the input will be matched regardless of
|
|||
|
case. The matched text given in `yytext' will have the preserved
|
|||
|
case (i.e., it will not be folded).
|
|||
|
|
|||
|
`-l'
|
|||
|
turns on maximum compatibility with the original AT&T `lex'
|
|||
|
implementation. Note that this does not mean *full*
|
|||
|
compatibility. Use of this option costs a considerable amount of
|
|||
|
performance, and it cannot be used with the `-+, -f, -F, -Cf', or
|
|||
|
`-CF' options. For details on the compatibilities it provides, see
|
|||
|
the section "Incompatibilities With Lex And POSIX" below. This
|
|||
|
option also results in the name `YY_FLEX_LEX_COMPAT' being
|
|||
|
#define'd in the generated scanner.
|
|||
|
|
|||
|
`-n'
|
|||
|
is another do-nothing, deprecated option included only for POSIX
|
|||
|
compliance.
|
|||
|
|
|||
|
`-p'
|
|||
|
generates a performance report to stderr. The report consists of
|
|||
|
comments regarding features of the `flex' input file which will
|
|||
|
cause a serious loss of performance in the resulting scanner. If
|
|||
|
you give the flag twice, you will also get comments regarding
|
|||
|
features that lead to minor performance losses.
|
|||
|
|
|||
|
Note that the use of `REJECT', `%option yylineno' and variable
|
|||
|
trailing context (see the Deficiencies / Bugs section below)
|
|||
|
entails a substantial performance penalty; use of `yymore()', the
|
|||
|
`^' operator, and the `-I' flag entail minor performance penalties.
|
|||
|
|
|||
|
`-s'
|
|||
|
causes the "default rule" (that unmatched scanner input is echoed
|
|||
|
to `stdout') to be suppressed. If the scanner encounters input
|
|||
|
that does not match any of its rules, it aborts with an error.
|
|||
|
This option is useful for finding holes in a scanner's rule set.
|
|||
|
|
|||
|
`-t'
|
|||
|
instructs `flex' to write the scanner it generates to standard
|
|||
|
output instead of `lex.yy.c'.
|
|||
|
|
|||
|
`-v'
|
|||
|
specifies that `flex' should write to `stderr' a summary of
|
|||
|
statistics regarding the scanner it generates. Most of the
|
|||
|
statistics are meaningless to the casual `flex' user, but the
|
|||
|
first line identifies the version of `flex' (same as reported by
|
|||
|
`-V'), and the next line the flags used when generating the
|
|||
|
scanner, including those that are on by default.
|
|||
|
|
|||
|
`-w'
|
|||
|
suppresses warning messages.
|
|||
|
|
|||
|
`-B'
|
|||
|
instructs `flex' to generate a *batch* scanner, the opposite of
|
|||
|
*interactive* scanners generated by `-I' (see below). In general,
|
|||
|
you use `-B' when you are *certain* that your scanner will never
|
|||
|
be used interactively, and you want to squeeze a *little* more
|
|||
|
performance out of it. If your goal is instead to squeeze out a
|
|||
|
*lot* more performance, you should be using the `-Cf' or `-CF'
|
|||
|
options (discussed below), which turn on `-B' automatically anyway.
|
|||
|
|
|||
|
`-F'
|
|||
|
specifies that the "fast" scanner table representation should be
|
|||
|
used (and stdio bypassed). This representation is about as fast
|
|||
|
as the full table representation `(-f)', and for some sets of
|
|||
|
patterns will be considerably smaller (and for others, larger).
|
|||
|
In general, if the pattern set contains both "keywords" and a
|
|||
|
catch-all, "identifier" rule, such as in the set:
|
|||
|
|
|||
|
"case" return TOK_CASE;
|
|||
|
"switch" return TOK_SWITCH;
|
|||
|
...
|
|||
|
"default" return TOK_DEFAULT;
|
|||
|
[a-z]+ return TOK_ID;
|
|||
|
|
|||
|
then you're better off using the full table representation. If
|
|||
|
only the "identifier" rule is present and you then use a hash
|
|||
|
table or some such to detect the keywords, you're better off using
|
|||
|
`-F'.
|
|||
|
|
|||
|
This option is equivalent to `-CFr' (see below). It cannot be
|
|||
|
used with `-+'.
|
|||
|
|
|||
|
`-I'
|
|||
|
instructs `flex' to generate an *interactive* scanner. An
|
|||
|
interactive scanner is one that only looks ahead to decide what
|
|||
|
token has been matched if it absolutely must. It turns out that
|
|||
|
always looking one extra character ahead, even if the scanner has
|
|||
|
already seen enough text to disambiguate the current token, is a
|
|||
|
bit faster than only looking ahead when necessary. But scanners
|
|||
|
that always look ahead give dreadful interactive performance; for
|
|||
|
example, when a user types a newline, it is not recognized as a
|
|||
|
newline token until they enter *another* token, which often means
|
|||
|
typing in another whole line.
|
|||
|
|
|||
|
`Flex' scanners default to *interactive* unless you use the `-Cf'
|
|||
|
or `-CF' table-compression options (see below). That's because if
|
|||
|
you're looking for high-performance you should be using one of
|
|||
|
these options, so if you didn't, `flex' assumes you'd rather trade
|
|||
|
off a bit of run-time performance for intuitive interactive
|
|||
|
behavior. Note also that you *cannot* use `-I' in conjunction
|
|||
|
with `-Cf' or `-CF'. Thus, this option is not really needed; it
|
|||
|
is on by default for all those cases in which it is allowed.
|
|||
|
|
|||
|
You can force a scanner to *not* be interactive by using `-B' (see
|
|||
|
above).
|
|||
|
|
|||
|
`-L'
|
|||
|
instructs `flex' not to generate `#line' directives. Without this
|
|||
|
option, `flex' peppers the generated scanner with #line directives
|
|||
|
so error messages in the actions will be correctly located with
|
|||
|
respect to either the original `flex' input file (if the errors
|
|||
|
are due to code in the input file), or `lex.yy.c' (if the errors
|
|||
|
are `flex's' fault - you should report these sorts of errors to
|
|||
|
the email address given below).
|
|||
|
|
|||
|
`-T'
|
|||
|
makes `flex' run in `trace' mode. It will generate a lot of
|
|||
|
messages to `stderr' concerning the form of the input and the
|
|||
|
resultant non-deterministic and deterministic finite automata.
|
|||
|
This option is mostly for use in maintaining `flex'.
|
|||
|
|
|||
|
`-V'
|
|||
|
prints the version number to `stdout' and exits. `--version' is a
|
|||
|
synonym for `-V'.
|
|||
|
|
|||
|
`-7'
|
|||
|
instructs `flex' to generate a 7-bit scanner, i.e., one which can
|
|||
|
only recognized 7-bit characters in its input. The advantage of
|
|||
|
using `-7' is that the scanner's tables can be up to half the size
|
|||
|
of those generated using the `-8' option (see below). The
|
|||
|
disadvantage is that such scanners often hang or crash if their
|
|||
|
input contains an 8-bit character.
|
|||
|
|
|||
|
Note, however, that unless you generate your scanner using the
|
|||
|
`-Cf' or `-CF' table compression options, use of `-7' will save
|
|||
|
only a small amount of table space, and make your scanner
|
|||
|
considerably less portable. `Flex's' default behavior is to
|
|||
|
generate an 8-bit scanner unless you use the `-Cf' or `-CF', in
|
|||
|
which case `flex' defaults to generating 7-bit scanners unless
|
|||
|
your site was always configured to generate 8-bit scanners (as
|
|||
|
will often be the case with non-USA sites). You can tell whether
|
|||
|
flex generated a 7-bit or an 8-bit scanner by inspecting the flag
|
|||
|
summary in the `-v' output as described above.
|
|||
|
|
|||
|
Note that if you use `-Cfe' or `-CFe' (those table compression
|
|||
|
options, but also using equivalence classes as discussed see
|
|||
|
below), flex still defaults to generating an 8-bit scanner, since
|
|||
|
usually with these compression options full 8-bit tables are not
|
|||
|
much more expensive than 7-bit tables.
|
|||
|
|
|||
|
`-8'
|
|||
|
instructs `flex' to generate an 8-bit scanner, i.e., one which can
|
|||
|
recognize 8-bit characters. This flag is only needed for scanners
|
|||
|
generated using `-Cf' or `-CF', as otherwise flex defaults to
|
|||
|
generating an 8-bit scanner anyway.
|
|||
|
|
|||
|
See the discussion of `-7' above for flex's default behavior and
|
|||
|
the tradeoffs between 7-bit and 8-bit scanners.
|
|||
|
|
|||
|
`-+'
|
|||
|
specifies that you want flex to generate a C++ scanner class. See
|
|||
|
the section on Generating C++ Scanners below for details.
|
|||
|
|
|||
|
`-C[aefFmr]'
|
|||
|
controls the degree of table compression and, more generally,
|
|||
|
trade-offs between small scanners and fast scanners.
|
|||
|
|
|||
|
`-Ca' ("align") instructs flex to trade off larger tables in the
|
|||
|
generated scanner for faster performance because the elements of
|
|||
|
the tables are better aligned for memory access and computation.
|
|||
|
On some RISC architectures, fetching and manipulating long-words
|
|||
|
is more efficient than with smaller-sized units such as
|
|||
|
shortwords. This option can double the size of the tables used by
|
|||
|
your scanner.
|
|||
|
|
|||
|
`-Ce' directs `flex' to construct "equivalence classes", i.e.,
|
|||
|
sets of characters which have identical lexical properties (for
|
|||
|
example, if the only appearance of digits in the `flex' input is
|
|||
|
in the character class "[0-9]" then the digits '0', '1', ..., '9'
|
|||
|
will all be put in the same equivalence class). Equivalence
|
|||
|
classes usually give dramatic reductions in the final table/object
|
|||
|
file sizes (typically a factor of 2-5) and are pretty cheap
|
|||
|
performance-wise (one array look-up per character scanned).
|
|||
|
|
|||
|
`-Cf' specifies that the *full* scanner tables should be generated
|
|||
|
- `flex' should not compress the tables by taking advantages of
|
|||
|
similar transition functions for different states.
|
|||
|
|
|||
|
`-CF' specifies that the alternate fast scanner representation
|
|||
|
(described above under the `-F' flag) should be used. This option
|
|||
|
cannot be used with `-+'.
|
|||
|
|
|||
|
`-Cm' directs `flex' to construct "meta-equivalence classes",
|
|||
|
which are sets of equivalence classes (or characters, if
|
|||
|
equivalence classes are not being used) that are commonly used
|
|||
|
together. Meta-equivalence classes are often a big win when using
|
|||
|
compressed tables, but they have a moderate performance impact
|
|||
|
(one or two "if" tests and one array look-up per character
|
|||
|
scanned).
|
|||
|
|
|||
|
`-Cr' causes the generated scanner to *bypass* use of the standard
|
|||
|
I/O library (stdio) for input. Instead of calling `fread()' or
|
|||
|
`getc()', the scanner will use the `read()' system call, resulting
|
|||
|
in a performance gain which varies from system to system, but in
|
|||
|
general is probably negligible unless you are also using `-Cf' or
|
|||
|
`-CF'. Using `-Cr' can cause strange behavior if, for example,
|
|||
|
you read from `yyin' using stdio prior to calling the scanner
|
|||
|
(because the scanner will miss whatever text your previous reads
|
|||
|
left in the stdio input buffer).
|
|||
|
|
|||
|
`-Cr' has no effect if you define `YY_INPUT' (see The Generated
|
|||
|
Scanner above).
|
|||
|
|
|||
|
A lone `-C' specifies that the scanner tables should be compressed
|
|||
|
but neither equivalence classes nor meta-equivalence classes
|
|||
|
should be used.
|
|||
|
|
|||
|
The options `-Cf' or `-CF' and `-Cm' do not make sense together -
|
|||
|
there is no opportunity for meta-equivalence classes if the table
|
|||
|
is not being compressed. Otherwise the options may be freely
|
|||
|
mixed, and are cumulative.
|
|||
|
|
|||
|
The default setting is `-Cem', which specifies that `flex' should
|
|||
|
generate equivalence classes and meta-equivalence classes. This
|
|||
|
setting provides the highest degree of table compression. You can
|
|||
|
trade off faster-executing scanners at the cost of larger tables
|
|||
|
with the following generally being true:
|
|||
|
|
|||
|
slowest & smallest
|
|||
|
-Cem
|
|||
|
-Cm
|
|||
|
-Ce
|
|||
|
-C
|
|||
|
-C{f,F}e
|
|||
|
-C{f,F}
|
|||
|
-C{f,F}a
|
|||
|
fastest & largest
|
|||
|
|
|||
|
Note that scanners with the smallest tables are usually generated
|
|||
|
and compiled the quickest, so during development you will usually
|
|||
|
want to use the default, maximal compression.
|
|||
|
|
|||
|
`-Cfe' is often a good compromise between speed and size for
|
|||
|
production scanners.
|
|||
|
|
|||
|
`-ooutput'
|
|||
|
directs flex to write the scanner to the file `out-' `put' instead
|
|||
|
of `lex.yy.c'. If you combine `-o' with the `-t' option, then the
|
|||
|
scanner is written to `stdout' but its `#line' directives (see the
|
|||
|
`-L' option above) refer to the file `output'.
|
|||
|
|
|||
|
`-Pprefix'
|
|||
|
changes the default `yy' prefix used by `flex' for all
|
|||
|
globally-visible variable and function names to instead be PREFIX.
|
|||
|
For example, `-Pfoo' changes the name of `yytext' to `footext'.
|
|||
|
It also changes the name of the default output file from
|
|||
|
`lex.yy.c' to `lex.foo.c'. Here are all of the names affected:
|
|||
|
|
|||
|
yy_create_buffer
|
|||
|
yy_delete_buffer
|
|||
|
yy_flex_debug
|
|||
|
yy_init_buffer
|
|||
|
yy_flush_buffer
|
|||
|
yy_load_buffer_state
|
|||
|
yy_switch_to_buffer
|
|||
|
yyin
|
|||
|
yyleng
|
|||
|
yylex
|
|||
|
yylineno
|
|||
|
yyout
|
|||
|
yyrestart
|
|||
|
yytext
|
|||
|
yywrap
|
|||
|
|
|||
|
(If you are using a C++ scanner, then only `yywrap' and
|
|||
|
`yyFlexLexer' are affected.) Within your scanner itself, you can
|
|||
|
still refer to the global variables and functions using either
|
|||
|
version of their name; but externally, they have the modified name.
|
|||
|
|
|||
|
This option lets you easily link together multiple `flex' programs
|
|||
|
into the same executable. Note, though, that using this option
|
|||
|
also renames `yywrap()', so you now *must* either provide your own
|
|||
|
(appropriately-named) version of the routine for your scanner, or
|
|||
|
use `%option noyywrap', as linking with `-lfl' no longer provides
|
|||
|
one for you by default.
|
|||
|
|
|||
|
`-Sskeleton_file'
|
|||
|
overrides the default skeleton file from which `flex' constructs
|
|||
|
its scanners. You'll never need this option unless you are doing
|
|||
|
`flex' maintenance or development.
|
|||
|
|
|||
|
`flex' also provides a mechanism for controlling options within the
|
|||
|
scanner specification itself, rather than from the flex command-line.
|
|||
|
This is done by including `%option' directives in the first section of
|
|||
|
the scanner specification. You can specify multiple options with a
|
|||
|
single `%option' directive, and multiple directives in the first
|
|||
|
section of your flex input file. Most options are given simply as
|
|||
|
names, optionally preceded by the word "no" (with no intervening
|
|||
|
whitespace) to negate their meaning. A number are equivalent to flex
|
|||
|
flags or their negation:
|
|||
|
|
|||
|
7bit -7 option
|
|||
|
8bit -8 option
|
|||
|
align -Ca option
|
|||
|
backup -b option
|
|||
|
batch -B option
|
|||
|
c++ -+ option
|
|||
|
|
|||
|
caseful or
|
|||
|
case-sensitive opposite of -i (default)
|
|||
|
|
|||
|
case-insensitive or
|
|||
|
caseless -i option
|
|||
|
|
|||
|
debug -d option
|
|||
|
default opposite of -s option
|
|||
|
ecs -Ce option
|
|||
|
fast -F option
|
|||
|
full -f option
|
|||
|
interactive -I option
|
|||
|
lex-compat -l option
|
|||
|
meta-ecs -Cm option
|
|||
|
perf-report -p option
|
|||
|
read -Cr option
|
|||
|
stdout -t option
|
|||
|
verbose -v option
|
|||
|
warn opposite of -w option
|
|||
|
(use "%option nowarn" for -w)
|
|||
|
|
|||
|
array equivalent to "%array"
|
|||
|
pointer equivalent to "%pointer" (default)
|
|||
|
|
|||
|
Some `%option's' provide features otherwise not available:
|
|||
|
|
|||
|
`always-interactive'
|
|||
|
instructs flex to generate a scanner which always considers its
|
|||
|
input "interactive". Normally, on each new input file the scanner
|
|||
|
calls `isatty()' in an attempt to determine whether the scanner's
|
|||
|
input source is interactive and thus should be read a character at
|
|||
|
a time. When this option is used, however, then no such call is
|
|||
|
made.
|
|||
|
|
|||
|
`main'
|
|||
|
directs flex to provide a default `main()' program for the
|
|||
|
scanner, which simply calls `yylex()'. This option implies
|
|||
|
`noyywrap' (see below).
|
|||
|
|
|||
|
`never-interactive'
|
|||
|
instructs flex to generate a scanner which never considers its
|
|||
|
input "interactive" (again, no call made to `isatty())'. This is
|
|||
|
the opposite of `always-' *interactive*.
|
|||
|
|
|||
|
`stack'
|
|||
|
enables the use of start condition stacks (see Start Conditions
|
|||
|
above).
|
|||
|
|
|||
|
`stdinit'
|
|||
|
if unset (i.e., `%option nostdinit') initializes `yyin' and
|
|||
|
`yyout' to nil `FILE' pointers, instead of `stdin' and `stdout'.
|
|||
|
|
|||
|
`yylineno'
|
|||
|
directs `flex' to generate a scanner that maintains the number of
|
|||
|
the current line read from its input in the global variable
|
|||
|
`yylineno'. This option is implied by `%option lex-compat'.
|
|||
|
|
|||
|
`yywrap'
|
|||
|
if unset (i.e., `%option noyywrap'), makes the scanner not call
|
|||
|
`yywrap()' upon an end-of-file, but simply assume that there are
|
|||
|
no more files to scan (until the user points `yyin' at a new file
|
|||
|
and calls `yylex()' again).
|
|||
|
|
|||
|
`flex' scans your rule actions to determine whether you use the
|
|||
|
`REJECT' or `yymore()' features. The `reject' and `yymore' options are
|
|||
|
available to override its decision as to whether you use the options,
|
|||
|
either by setting them (e.g., `%option reject') to indicate the feature
|
|||
|
is indeed used, or unsetting them to indicate it actually is not used
|
|||
|
(e.g., `%option noyymore').
|
|||
|
|
|||
|
Three options take string-delimited values, offset with '=':
|
|||
|
|
|||
|
%option outfile="ABC"
|
|||
|
|
|||
|
is equivalent to `-oABC', and
|
|||
|
|
|||
|
%option prefix="XYZ"
|
|||
|
|
|||
|
is equivalent to `-PXYZ'.
|
|||
|
|
|||
|
Finally,
|
|||
|
|
|||
|
%option yyclass="foo"
|
|||
|
|
|||
|
only applies when generating a C++ scanner (`-+' option). It informs
|
|||
|
`flex' that you have derived `foo' as a subclass of `yyFlexLexer' so
|
|||
|
`flex' will place your actions in the member function `foo::yylex()'
|
|||
|
instead of `yyFlexLexer::yylex()'. It also generates a
|
|||
|
`yyFlexLexer::yylex()' member function that emits a run-time error (by
|
|||
|
invoking `yyFlexLexer::LexerError()') if called. See Generating C++
|
|||
|
Scanners, below, for additional information.
|
|||
|
|
|||
|
A number of options are available for lint purists who want to
|
|||
|
suppress the appearance of unneeded routines in the generated scanner.
|
|||
|
Each of the following, if unset, results in the corresponding routine
|
|||
|
not appearing in the generated scanner:
|
|||
|
|
|||
|
input, unput
|
|||
|
yy_push_state, yy_pop_state, yy_top_state
|
|||
|
yy_scan_buffer, yy_scan_bytes, yy_scan_string
|
|||
|
|
|||
|
(though `yy_push_state()' and friends won't appear anyway unless you
|
|||
|
use `%option stack').
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Performance, Next: C++, Prev: Options, Up: Top
|
|||
|
|
|||
|
Performance considerations
|
|||
|
==========================
|
|||
|
|
|||
|
The main design goal of `flex' is that it generate high-performance
|
|||
|
scanners. It has been optimized for dealing well with large sets of
|
|||
|
rules. Aside from the effects on scanner speed of the table
|
|||
|
compression `-C' options outlined above, there are a number of
|
|||
|
options/actions which degrade performance. These are, from most
|
|||
|
expensive to least:
|
|||
|
|
|||
|
REJECT
|
|||
|
%option yylineno
|
|||
|
arbitrary trailing context
|
|||
|
|
|||
|
pattern sets that require backing up
|
|||
|
%array
|
|||
|
%option interactive
|
|||
|
%option always-interactive
|
|||
|
|
|||
|
'^' beginning-of-line operator
|
|||
|
yymore()
|
|||
|
|
|||
|
with the first three all being quite expensive and the last two
|
|||
|
being quite cheap. Note also that `unput()' is implemented as a
|
|||
|
routine call that potentially does quite a bit of work, while
|
|||
|
`yyless()' is a quite-cheap macro; so if just putting back some excess
|
|||
|
text you scanned, use `yyless()'.
|
|||
|
|
|||
|
`REJECT' should be avoided at all costs when performance is
|
|||
|
important. It is a particularly expensive option.
|
|||
|
|
|||
|
Getting rid of backing up is messy and often may be an enormous
|
|||
|
amount of work for a complicated scanner. In principal, one begins by
|
|||
|
using the `-b' flag to generate a `lex.backup' file. For example, on
|
|||
|
the input
|
|||
|
|
|||
|
%%
|
|||
|
foo return TOK_KEYWORD;
|
|||
|
foobar return TOK_KEYWORD;
|
|||
|
|
|||
|
the file looks like:
|
|||
|
|
|||
|
State #6 is non-accepting -
|
|||
|
associated rule line numbers:
|
|||
|
2 3
|
|||
|
out-transitions: [ o ]
|
|||
|
jam-transitions: EOF [ \001-n p-\177 ]
|
|||
|
|
|||
|
State #8 is non-accepting -
|
|||
|
associated rule line numbers:
|
|||
|
3
|
|||
|
out-transitions: [ a ]
|
|||
|
jam-transitions: EOF [ \001-` b-\177 ]
|
|||
|
|
|||
|
State #9 is non-accepting -
|
|||
|
associated rule line numbers:
|
|||
|
3
|
|||
|
out-transitions: [ r ]
|
|||
|
jam-transitions: EOF [ \001-q s-\177 ]
|
|||
|
|
|||
|
Compressed tables always back up.
|
|||
|
|
|||
|
The first few lines tell us that there's a scanner state in which it
|
|||
|
can make a transition on an 'o' but not on any other character, and
|
|||
|
that in that state the currently scanned text does not match any rule.
|
|||
|
The state occurs when trying to match the rules found at lines 2 and 3
|
|||
|
in the input file. If the scanner is in that state and then reads
|
|||
|
something other than an 'o', it will have to back up to find a rule
|
|||
|
which is matched. With a bit of head-scratching one can see that this
|
|||
|
must be the state it's in when it has seen "fo". When this has
|
|||
|
happened, if anything other than another 'o' is seen, the scanner will
|
|||
|
have to back up to simply match the 'f' (by the default rule).
|
|||
|
|
|||
|
The comment regarding State #8 indicates there's a problem when
|
|||
|
"foob" has been scanned. Indeed, on any character other than an 'a',
|
|||
|
the scanner will have to back up to accept "foo". Similarly, the
|
|||
|
comment for State #9 concerns when "fooba" has been scanned and an 'r'
|
|||
|
does not follow.
|
|||
|
|
|||
|
The final comment reminds us that there's no point going to all the
|
|||
|
trouble of removing backing up from the rules unless we're using `-Cf'
|
|||
|
or `-CF', since there's no performance gain doing so with compressed
|
|||
|
scanners.
|
|||
|
|
|||
|
The way to remove the backing up is to add "error" rules:
|
|||
|
|
|||
|
%%
|
|||
|
foo return TOK_KEYWORD;
|
|||
|
foobar return TOK_KEYWORD;
|
|||
|
|
|||
|
fooba |
|
|||
|
foob |
|
|||
|
fo {
|
|||
|
/* false alarm, not really a keyword */
|
|||
|
return TOK_ID;
|
|||
|
}
|
|||
|
|
|||
|
Eliminating backing up among a list of keywords can also be done
|
|||
|
using a "catch-all" rule:
|
|||
|
|
|||
|
%%
|
|||
|
foo return TOK_KEYWORD;
|
|||
|
foobar return TOK_KEYWORD;
|
|||
|
|
|||
|
[a-z]+ return TOK_ID;
|
|||
|
|
|||
|
This is usually the best solution when appropriate.
|
|||
|
|
|||
|
Backing up messages tend to cascade. With a complicated set of
|
|||
|
rules it's not uncommon to get hundreds of messages. If one can
|
|||
|
decipher them, though, it often only takes a dozen or so rules to
|
|||
|
eliminate the backing up (though it's easy to make a mistake and have
|
|||
|
an error rule accidentally match a valid token. A possible future
|
|||
|
`flex' feature will be to automatically add rules to eliminate backing
|
|||
|
up).
|
|||
|
|
|||
|
It's important to keep in mind that you gain the benefits of
|
|||
|
eliminating backing up only if you eliminate *every* instance of
|
|||
|
backing up. Leaving just one means you gain nothing.
|
|||
|
|
|||
|
VARIABLE trailing context (where both the leading and trailing parts
|
|||
|
do not have a fixed length) entails almost the same performance loss as
|
|||
|
`REJECT' (i.e., substantial). So when possible a rule like:
|
|||
|
|
|||
|
%%
|
|||
|
mouse|rat/(cat|dog) run();
|
|||
|
|
|||
|
is better written:
|
|||
|
|
|||
|
%%
|
|||
|
mouse/cat|dog run();
|
|||
|
rat/cat|dog run();
|
|||
|
|
|||
|
or as
|
|||
|
|
|||
|
%%
|
|||
|
mouse|rat/cat run();
|
|||
|
mouse|rat/dog run();
|
|||
|
|
|||
|
Note that here the special '|' action does *not* provide any
|
|||
|
savings, and can even make things worse (see Deficiencies / Bugs below).
|
|||
|
|
|||
|
Another area where the user can increase a scanner's performance
|
|||
|
(and one that's easier to implement) arises from the fact that the
|
|||
|
longer the tokens matched, the faster the scanner will run. This is
|
|||
|
because with long tokens the processing of most input characters takes
|
|||
|
place in the (short) inner scanning loop, and does not often have to go
|
|||
|
through the additional work of setting up the scanning environment
|
|||
|
(e.g., `yytext') for the action. Recall the scanner for C comments:
|
|||
|
|
|||
|
%x comment
|
|||
|
%%
|
|||
|
int line_num = 1;
|
|||
|
|
|||
|
"/*" BEGIN(comment);
|
|||
|
|
|||
|
<comment>[^*\n]*
|
|||
|
<comment>"*"+[^*/\n]*
|
|||
|
<comment>\n ++line_num;
|
|||
|
<comment>"*"+"/" BEGIN(INITIAL);
|
|||
|
|
|||
|
This could be sped up by writing it as:
|
|||
|
|
|||
|
%x comment
|
|||
|
%%
|
|||
|
int line_num = 1;
|
|||
|
|
|||
|
"/*" BEGIN(comment);
|
|||
|
|
|||
|
<comment>[^*\n]*
|
|||
|
<comment>[^*\n]*\n ++line_num;
|
|||
|
<comment>"*"+[^*/\n]*
|
|||
|
<comment>"*"+[^*/\n]*\n ++line_num;
|
|||
|
<comment>"*"+"/" BEGIN(INITIAL);
|
|||
|
|
|||
|
Now instead of each newline requiring the processing of another
|
|||
|
action, recognizing the newlines is "distributed" over the other rules
|
|||
|
to keep the matched text as long as possible. Note that *adding* rules
|
|||
|
does *not* slow down the scanner! The speed of the scanner is
|
|||
|
independent of the number of rules or (modulo the considerations given
|
|||
|
at the beginning of this section) how complicated the rules are with
|
|||
|
regard to operators such as '*' and '|'.
|
|||
|
|
|||
|
A final example in speeding up a scanner: suppose you want to scan
|
|||
|
through a file containing identifiers and keywords, one per line and
|
|||
|
with no other extraneous characters, and recognize all the keywords. A
|
|||
|
natural first approach is:
|
|||
|
|
|||
|
%%
|
|||
|
asm |
|
|||
|
auto |
|
|||
|
break |
|
|||
|
... etc ...
|
|||
|
volatile |
|
|||
|
while /* it's a keyword */
|
|||
|
|
|||
|
.|\n /* it's not a keyword */
|
|||
|
|
|||
|
To eliminate the back-tracking, introduce a catch-all rule:
|
|||
|
|
|||
|
%%
|
|||
|
asm |
|
|||
|
auto |
|
|||
|
break |
|
|||
|
... etc ...
|
|||
|
volatile |
|
|||
|
while /* it's a keyword */
|
|||
|
|
|||
|
[a-z]+ |
|
|||
|
.|\n /* it's not a keyword */
|
|||
|
|
|||
|
Now, if it's guaranteed that there's exactly one word per line, then
|
|||
|
we can reduce the total number of matches by a half by merging in the
|
|||
|
recognition of newlines with that of the other tokens:
|
|||
|
|
|||
|
%%
|
|||
|
asm\n |
|
|||
|
auto\n |
|
|||
|
break\n |
|
|||
|
... etc ...
|
|||
|
volatile\n |
|
|||
|
while\n /* it's a keyword */
|
|||
|
|
|||
|
[a-z]+\n |
|
|||
|
.|\n /* it's not a keyword */
|
|||
|
|
|||
|
One has to be careful here, as we have now reintroduced backing up
|
|||
|
into the scanner. In particular, while *we* know that there will never
|
|||
|
be any characters in the input stream other than letters or newlines,
|
|||
|
`flex' can't figure this out, and it will plan for possibly needing to
|
|||
|
back up when it has scanned a token like "auto" and then the next
|
|||
|
character is something other than a newline or a letter. Previously it
|
|||
|
would then just match the "auto" rule and be done, but now it has no
|
|||
|
"auto" rule, only a "auto\n" rule. To eliminate the possibility of
|
|||
|
backing up, we could either duplicate all rules but without final
|
|||
|
newlines, or, since we never expect to encounter such an input and
|
|||
|
therefore don't how it's classified, we can introduce one more
|
|||
|
catch-all rule, this one which doesn't include a newline:
|
|||
|
|
|||
|
%%
|
|||
|
asm\n |
|
|||
|
auto\n |
|
|||
|
break\n |
|
|||
|
... etc ...
|
|||
|
volatile\n |
|
|||
|
while\n /* it's a keyword */
|
|||
|
|
|||
|
[a-z]+\n |
|
|||
|
[a-z]+ |
|
|||
|
.|\n /* it's not a keyword */
|
|||
|
|
|||
|
Compiled with `-Cf', this is about as fast as one can get a `flex'
|
|||
|
scanner to go for this particular problem.
|
|||
|
|
|||
|
A final note: `flex' is slow when matching NUL's, particularly when
|
|||
|
a token contains multiple NUL's. It's best to write rules which match
|
|||
|
*short* amounts of text if it's anticipated that the text will often
|
|||
|
include NUL's.
|
|||
|
|
|||
|
Another final note regarding performance: as mentioned above in the
|
|||
|
section How the Input is Matched, dynamically resizing `yytext' to
|
|||
|
accommodate huge tokens is a slow process because it presently requires
|
|||
|
that the (huge) token be rescanned from the beginning. Thus if
|
|||
|
performance is vital, you should attempt to match "large" quantities of
|
|||
|
text but not "huge" quantities, where the cutoff between the two is at
|
|||
|
about 8K characters/token.
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: C++, Next: Incompatibilities, Prev: Performance, Up: Top
|
|||
|
|
|||
|
Generating C++ scanners
|
|||
|
=======================
|
|||
|
|
|||
|
`flex' provides two different ways to generate scanners for use with
|
|||
|
C++. The first way is to simply compile a scanner generated by `flex'
|
|||
|
using a C++ compiler instead of a C compiler. You should not encounter
|
|||
|
any compilations errors (please report any you find to the email address
|
|||
|
given in the Author section below). You can then use C++ code in your
|
|||
|
rule actions instead of C code. Note that the default input source for
|
|||
|
your scanner remains `yyin', and default echoing is still done to
|
|||
|
`yyout'. Both of these remain `FILE *' variables and not C++ `streams'.
|
|||
|
|
|||
|
You can also use `flex' to generate a C++ scanner class, using the
|
|||
|
`-+' option, (or, equivalently, `%option c++'), which is automatically
|
|||
|
specified if the name of the flex executable ends in a `+', such as
|
|||
|
`flex++'. When using this option, flex defaults to generating the
|
|||
|
scanner to the file `lex.yy.cc' instead of `lex.yy.c'. The generated
|
|||
|
scanner includes the header file `FlexLexer.h', which defines the
|
|||
|
interface to two C++ classes.
|
|||
|
|
|||
|
The first class, `FlexLexer', provides an abstract base class
|
|||
|
defining the general scanner class interface. It provides the
|
|||
|
following member functions:
|
|||
|
|
|||
|
`const char* YYText()'
|
|||
|
returns the text of the most recently matched token, the
|
|||
|
equivalent of `yytext'.
|
|||
|
|
|||
|
`int YYLeng()'
|
|||
|
returns the length of the most recently matched token, the
|
|||
|
equivalent of `yyleng'.
|
|||
|
|
|||
|
`int lineno() const'
|
|||
|
returns the current input line number (see `%option yylineno'), or
|
|||
|
1 if `%option yylineno' was not used.
|
|||
|
|
|||
|
`void set_debug( int flag )'
|
|||
|
sets the debugging flag for the scanner, equivalent to assigning to
|
|||
|
`yy_flex_debug' (see the Options section above). Note that you
|
|||
|
must build the scanner using `%option debug' to include debugging
|
|||
|
information in it.
|
|||
|
|
|||
|
`int debug() const'
|
|||
|
returns the current setting of the debugging flag.
|
|||
|
|
|||
|
Also provided are member functions equivalent to
|
|||
|
`yy_switch_to_buffer(), yy_create_buffer()' (though the first argument
|
|||
|
is an `istream*' object pointer and not a `FILE*', `yy_flush_buffer()',
|
|||
|
`yy_delete_buffer()', and `yyrestart()' (again, the first argument is a
|
|||
|
`istream*' object pointer).
|
|||
|
|
|||
|
The second class defined in `FlexLexer.h' is `yyFlexLexer', which is
|
|||
|
derived from `FlexLexer'. It defines the following additional member
|
|||
|
functions:
|
|||
|
|
|||
|
`yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )'
|
|||
|
constructs a `yyFlexLexer' object using the given streams for
|
|||
|
input and output. If not specified, the streams default to `cin'
|
|||
|
and `cout', respectively.
|
|||
|
|
|||
|
`virtual int yylex()'
|
|||
|
performs the same role is `yylex()' does for ordinary flex
|
|||
|
scanners: it scans the input stream, consuming tokens, until a
|
|||
|
rule's action returns a value. If you derive a subclass S from
|
|||
|
`yyFlexLexer' and want to access the member functions and
|
|||
|
variables of S inside `yylex()', then you need to use `%option
|
|||
|
yyclass="S"' to inform `flex' that you will be using that subclass
|
|||
|
instead of `yyFlexLexer'. In this case, rather than generating
|
|||
|
`yyFlexLexer::yylex()', `flex' generates `S::yylex()' (and also
|
|||
|
generates a dummy `yyFlexLexer::yylex()' that calls
|
|||
|
`yyFlexLexer::LexerError()' if called).
|
|||
|
|
|||
|
`virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0)'
|
|||
|
reassigns `yyin' to `new_in' (if non-nil) and `yyout' to `new_out'
|
|||
|
(ditto), deleting the previous input buffer if `yyin' is
|
|||
|
reassigned.
|
|||
|
|
|||
|
`int yylex( istream* new_in = 0, ostream* new_out = 0 )'
|
|||
|
first switches the input streams via `switch_streams( new_in,
|
|||
|
new_out )' and then returns the value of `yylex()'.
|
|||
|
|
|||
|
In addition, `yyFlexLexer' defines the following protected virtual
|
|||
|
functions which you can redefine in derived classes to tailor the
|
|||
|
scanner:
|
|||
|
|
|||
|
`virtual int LexerInput( char* buf, int max_size )'
|
|||
|
reads up to `max_size' characters into BUF and returns the number
|
|||
|
of characters read. To indicate end-of-input, return 0
|
|||
|
characters. Note that "interactive" scanners (see the `-B' and
|
|||
|
`-I' flags) define the macro `YY_INTERACTIVE'. If you redefine
|
|||
|
`LexerInput()' and need to take different actions depending on
|
|||
|
whether or not the scanner might be scanning an interactive input
|
|||
|
source, you can test for the presence of this name via `#ifdef'.
|
|||
|
|
|||
|
`virtual void LexerOutput( const char* buf, int size )'
|
|||
|
writes out SIZE characters from the buffer BUF, which, while
|
|||
|
NUL-terminated, may also contain "internal" NUL's if the scanner's
|
|||
|
rules can match text with NUL's in them.
|
|||
|
|
|||
|
`virtual void LexerError( const char* msg )'
|
|||
|
reports a fatal error message. The default version of this
|
|||
|
function writes the message to the stream `cerr' and exits.
|
|||
|
|
|||
|
Note that a `yyFlexLexer' object contains its *entire* scanning
|
|||
|
state. Thus you can use such objects to create reentrant scanners.
|
|||
|
You can instantiate multiple instances of the same `yyFlexLexer' class,
|
|||
|
and you can also combine multiple C++ scanner classes together in the
|
|||
|
same program using the `-P' option discussed above. Finally, note that
|
|||
|
the `%array' feature is not available to C++ scanner classes; you must
|
|||
|
use `%pointer' (the default).
|
|||
|
|
|||
|
Here is an example of a simple C++ scanner:
|
|||
|
|
|||
|
// An example of using the flex C++ scanner class.
|
|||
|
|
|||
|
%{
|
|||
|
int mylineno = 0;
|
|||
|
%}
|
|||
|
|
|||
|
string \"[^\n"]+\"
|
|||
|
|
|||
|
ws [ \t]+
|
|||
|
|
|||
|
alpha [A-Za-z]
|
|||
|
dig [0-9]
|
|||
|
name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])*
|
|||
|
num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)?
|
|||
|
num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)?
|
|||
|
number {num1}|{num2}
|
|||
|
|
|||
|
%%
|
|||
|
|
|||
|
{ws} /* skip blanks and tabs */
|
|||
|
|
|||
|
"/*" {
|
|||
|
int c;
|
|||
|
|
|||
|
while((c = yyinput()) != 0)
|
|||
|
{
|
|||
|
if(c == '\n')
|
|||
|
++mylineno;
|
|||
|
|
|||
|
else if(c == '*')
|
|||
|
{
|
|||
|
if((c = yyinput()) == '/')
|
|||
|
break;
|
|||
|
else
|
|||
|
unput(c);
|
|||
|
}
|
|||
|
}
|
|||
|
}
|
|||
|
|
|||
|
{number} cout << "number " << YYText() << '\n';
|
|||
|
|
|||
|
\n mylineno++;
|
|||
|
|
|||
|
{name} cout << "name " << YYText() << '\n';
|
|||
|
|
|||
|
{string} cout << "string " << YYText() << '\n';
|
|||
|
|
|||
|
%%
|
|||
|
|
|||
|
Version 2.5 December 1994 44
|
|||
|
|
|||
|
int main( int /* argc */, char** /* argv */ )
|
|||
|
{
|
|||
|
FlexLexer* lexer = new yyFlexLexer;
|
|||
|
while(lexer->yylex() != 0)
|
|||
|
;
|
|||
|
return 0;
|
|||
|
}
|
|||
|
|
|||
|
If you want to create multiple (different) lexer classes, you use
|
|||
|
the `-P' flag (or the `prefix=' option) to rename each `yyFlexLexer' to
|
|||
|
some other `xxFlexLexer'. You then can include `<FlexLexer.h>' in your
|
|||
|
other sources once per lexer class, first renaming `yyFlexLexer' as
|
|||
|
follows:
|
|||
|
|
|||
|
#undef yyFlexLexer
|
|||
|
#define yyFlexLexer xxFlexLexer
|
|||
|
#include <FlexLexer.h>
|
|||
|
|
|||
|
#undef yyFlexLexer
|
|||
|
#define yyFlexLexer zzFlexLexer
|
|||
|
#include <FlexLexer.h>
|
|||
|
|
|||
|
if, for example, you used `%option prefix="xx"' for one of your
|
|||
|
scanners and `%option prefix="zz"' for the other.
|
|||
|
|
|||
|
IMPORTANT: the present form of the scanning class is *experimental*
|
|||
|
and may change considerably between major releases.
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Incompatibilities, Next: Diagnostics, Prev: C++, Up: Top
|
|||
|
|
|||
|
Incompatibilities with `lex' and POSIX
|
|||
|
======================================
|
|||
|
|
|||
|
`flex' is a rewrite of the AT&T Unix `lex' tool (the two
|
|||
|
implementations do not share any code, though), with some extensions
|
|||
|
and incompatibilities, both of which are of concern to those who wish
|
|||
|
to write scanners acceptable to either implementation. Flex is fully
|
|||
|
compliant with the POSIX `lex' specification, except that when using
|
|||
|
`%pointer' (the default), a call to `unput()' destroys the contents of
|
|||
|
`yytext', which is counter to the POSIX specification.
|
|||
|
|
|||
|
In this section we discuss all of the known areas of incompatibility
|
|||
|
between flex, AT&T lex, and the POSIX specification.
|
|||
|
|
|||
|
`flex's' `-l' option turns on maximum compatibility with the
|
|||
|
original AT&T `lex' implementation, at the cost of a major loss in the
|
|||
|
generated scanner's performance. We note below which incompatibilities
|
|||
|
can be overcome using the `-l' option.
|
|||
|
|
|||
|
`flex' is fully compatible with `lex' with the following exceptions:
|
|||
|
|
|||
|
- The undocumented `lex' scanner internal variable `yylineno' is not
|
|||
|
supported unless `-l' or `%option yylineno' is used. `yylineno'
|
|||
|
should be maintained on a per-buffer basis, rather than a
|
|||
|
per-scanner (single global variable) basis. `yylineno' is not
|
|||
|
part of the POSIX specification.
|
|||
|
|
|||
|
- The `input()' routine is not redefinable, though it may be called
|
|||
|
to read characters following whatever has been matched by a rule.
|
|||
|
If `input()' encounters an end-of-file the normal `yywrap()'
|
|||
|
processing is done. A "real" end-of-file is returned by `input()'
|
|||
|
as `EOF'.
|
|||
|
|
|||
|
Input is instead controlled by defining the `YY_INPUT' macro.
|
|||
|
|
|||
|
The `flex' restriction that `input()' cannot be redefined is in
|
|||
|
accordance with the POSIX specification, which simply does not
|
|||
|
specify any way of controlling the scanner's input other than by
|
|||
|
making an initial assignment to `yyin'.
|
|||
|
|
|||
|
- The `unput()' routine is not redefinable. This restriction is in
|
|||
|
accordance with POSIX.
|
|||
|
|
|||
|
- `flex' scanners are not as reentrant as `lex' scanners. In
|
|||
|
particular, if you have an interactive scanner and an interrupt
|
|||
|
handler which long-jumps out of the scanner, and the scanner is
|
|||
|
subsequently called again, you may get the following message:
|
|||
|
|
|||
|
fatal flex scanner internal error--end of buffer missed
|
|||
|
|
|||
|
To reenter the scanner, first use
|
|||
|
|
|||
|
yyrestart( yyin );
|
|||
|
|
|||
|
Note that this call will throw away any buffered input; usually
|
|||
|
this isn't a problem with an interactive scanner.
|
|||
|
|
|||
|
Also note that flex C++ scanner classes *are* reentrant, so if
|
|||
|
using C++ is an option for you, you should use them instead. See
|
|||
|
"Generating C++ Scanners" above for details.
|
|||
|
|
|||
|
- `output()' is not supported. Output from the `ECHO' macro is done
|
|||
|
to the file-pointer `yyout' (default `stdout').
|
|||
|
|
|||
|
`output()' is not part of the POSIX specification.
|
|||
|
|
|||
|
- `lex' does not support exclusive start conditions (%x), though
|
|||
|
they are in the POSIX specification.
|
|||
|
|
|||
|
- When definitions are expanded, `flex' encloses them in
|
|||
|
parentheses. With lex, the following:
|
|||
|
|
|||
|
NAME [A-Z][A-Z0-9]*
|
|||
|
%%
|
|||
|
foo{NAME}? printf( "Found it\n" );
|
|||
|
%%
|
|||
|
|
|||
|
will not match the string "foo" because when the macro is expanded
|
|||
|
the rule is equivalent to "foo[A-Z][A-Z0-9]*?" and the precedence
|
|||
|
is such that the '?' is associated with "[A-Z0-9]*". With `flex',
|
|||
|
the rule will be expanded to "foo([A-Z][A-Z0-9]*)?" and so the
|
|||
|
string "foo" will match.
|
|||
|
|
|||
|
Note that if the definition begins with `^' or ends with `$' then
|
|||
|
it is *not* expanded with parentheses, to allow these operators to
|
|||
|
appear in definitions without losing their special meanings. But
|
|||
|
the `<s>, /', and `<<EOF>>' operators cannot be used in a `flex'
|
|||
|
definition.
|
|||
|
|
|||
|
Using `-l' results in the `lex' behavior of no parentheses around
|
|||
|
the definition.
|
|||
|
|
|||
|
The POSIX specification is that the definition be enclosed in
|
|||
|
parentheses.
|
|||
|
|
|||
|
- Some implementations of `lex' allow a rule's action to begin on a
|
|||
|
separate line, if the rule's pattern has trailing whitespace:
|
|||
|
|
|||
|
%%
|
|||
|
foo|bar<space here>
|
|||
|
{ foobar_action(); }
|
|||
|
|
|||
|
`flex' does not support this feature.
|
|||
|
|
|||
|
- The `lex' `%r' (generate a Ratfor scanner) option is not
|
|||
|
supported. It is not part of the POSIX specification.
|
|||
|
|
|||
|
- After a call to `unput()', `yytext' is undefined until the next
|
|||
|
token is matched, unless the scanner was built using `%array'.
|
|||
|
This is not the case with `lex' or the POSIX specification. The
|
|||
|
`-l' option does away with this incompatibility.
|
|||
|
|
|||
|
- The precedence of the `{}' (numeric range) operator is different.
|
|||
|
`lex' interprets "abc{1,3}" as "match one, two, or three
|
|||
|
occurrences of 'abc'", whereas `flex' interprets it as "match 'ab'
|
|||
|
followed by one, two, or three occurrences of 'c'". The latter is
|
|||
|
in agreement with the POSIX specification.
|
|||
|
|
|||
|
- The precedence of the `^' operator is different. `lex' interprets
|
|||
|
"^foo|bar" as "match either 'foo' at the beginning of a line, or
|
|||
|
'bar' anywhere", whereas `flex' interprets it as "match either
|
|||
|
'foo' or 'bar' if they come at the beginning of a line". The
|
|||
|
latter is in agreement with the POSIX specification.
|
|||
|
|
|||
|
- The special table-size declarations such as `%a' supported by
|
|||
|
`lex' are not required by `flex' scanners; `flex' ignores them.
|
|||
|
|
|||
|
- The name FLEX_SCANNER is #define'd so scanners may be written for
|
|||
|
use with either `flex' or `lex'. Scanners also include
|
|||
|
`YY_FLEX_MAJOR_VERSION' and `YY_FLEX_MINOR_VERSION' indicating
|
|||
|
which version of `flex' generated the scanner (for example, for the
|
|||
|
2.5 release, these defines would be 2 and 5 respectively).
|
|||
|
|
|||
|
The following `flex' features are not included in `lex' or the POSIX
|
|||
|
specification:
|
|||
|
|
|||
|
C++ scanners
|
|||
|
%option
|
|||
|
start condition scopes
|
|||
|
start condition stacks
|
|||
|
interactive/non-interactive scanners
|
|||
|
yy_scan_string() and friends
|
|||
|
yyterminate()
|
|||
|
yy_set_interactive()
|
|||
|
yy_set_bol()
|
|||
|
YY_AT_BOL()
|
|||
|
<<EOF>>
|
|||
|
<*>
|
|||
|
YY_DECL
|
|||
|
YY_START
|
|||
|
YY_USER_ACTION
|
|||
|
YY_USER_INIT
|
|||
|
#line directives
|
|||
|
%{}'s around actions
|
|||
|
multiple actions on a line
|
|||
|
|
|||
|
plus almost all of the flex flags. The last feature in the list refers
|
|||
|
to the fact that with `flex' you can put multiple actions on the same
|
|||
|
line, separated with semicolons, while with `lex', the following
|
|||
|
|
|||
|
foo handle_foo(); ++num_foos_seen;
|
|||
|
|
|||
|
is (rather surprisingly) truncated to
|
|||
|
|
|||
|
foo handle_foo();
|
|||
|
|
|||
|
`flex' does not truncate the action. Actions that are not enclosed
|
|||
|
in braces are simply terminated at the end of the line.
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Diagnostics, Next: Files, Prev: Incompatibilities, Up: Top
|
|||
|
|
|||
|
Diagnostics
|
|||
|
===========
|
|||
|
|
|||
|
`warning, rule cannot be matched'
|
|||
|
indicates that the given rule cannot be matched because it follows
|
|||
|
other rules that will always match the same text as it. For
|
|||
|
example, in the following "foo" cannot be matched because it comes
|
|||
|
after an identifier "catch-all" rule:
|
|||
|
|
|||
|
[a-z]+ got_identifier();
|
|||
|
foo got_foo();
|
|||
|
|
|||
|
Using `REJECT' in a scanner suppresses this warning.
|
|||
|
|
|||
|
`warning, -s option given but default rule can be matched'
|
|||
|
means that it is possible (perhaps only in a particular start
|
|||
|
condition) that the default rule (match any single character) is
|
|||
|
the only one that will match a particular input. Since `-s' was
|
|||
|
given, presumably this is not intended.
|
|||
|
|
|||
|
`reject_used_but_not_detected undefined'
|
|||
|
`yymore_used_but_not_detected undefined'
|
|||
|
These errors can occur at compile time. They indicate that the
|
|||
|
scanner uses `REJECT' or `yymore()' but that `flex' failed to
|
|||
|
notice the fact, meaning that `flex' scanned the first two sections
|
|||
|
looking for occurrences of these actions and failed to find any,
|
|||
|
but somehow you snuck some in (via a #include file, for example).
|
|||
|
Use `%option reject' or `%option yymore' to indicate to flex that
|
|||
|
you really do use these features.
|
|||
|
|
|||
|
`flex scanner jammed'
|
|||
|
a scanner compiled with `-s' has encountered an input string which
|
|||
|
wasn't matched by any of its rules. This error can also occur due
|
|||
|
to internal problems.
|
|||
|
|
|||
|
`token too large, exceeds YYLMAX'
|
|||
|
your scanner uses `%array' and one of its rules matched a string
|
|||
|
longer than the `YYL-' `MAX' constant (8K bytes by default). You
|
|||
|
can increase the value by #define'ing `YYLMAX' in the definitions
|
|||
|
section of your `flex' input.
|
|||
|
|
|||
|
`scanner requires -8 flag to use the character 'X''
|
|||
|
Your scanner specification includes recognizing the 8-bit
|
|||
|
character X and you did not specify the -8 flag, and your scanner
|
|||
|
defaulted to 7-bit because you used the `-Cf' or `-CF' table
|
|||
|
compression options. See the discussion of the `-7' flag for
|
|||
|
details.
|
|||
|
|
|||
|
`flex scanner push-back overflow'
|
|||
|
you used `unput()' to push back so much text that the scanner's
|
|||
|
buffer could not hold both the pushed-back text and the current
|
|||
|
token in `yytext'. Ideally the scanner should dynamically resize
|
|||
|
the buffer in this case, but at present it does not.
|
|||
|
|
|||
|
`input buffer overflow, can't enlarge buffer because scanner uses REJECT'
|
|||
|
the scanner was working on matching an extremely large token and
|
|||
|
needed to expand the input buffer. This doesn't work with
|
|||
|
scanners that use `REJECT'.
|
|||
|
|
|||
|
`fatal flex scanner internal error--end of buffer missed'
|
|||
|
This can occur in an scanner which is reentered after a long-jump
|
|||
|
has jumped out (or over) the scanner's activation frame. Before
|
|||
|
reentering the scanner, use:
|
|||
|
|
|||
|
yyrestart( yyin );
|
|||
|
|
|||
|
or, as noted above, switch to using the C++ scanner class.
|
|||
|
|
|||
|
`too many start conditions in <> construct!'
|
|||
|
you listed more start conditions in a <> construct than exist (so
|
|||
|
you must have listed at least one of them twice).
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Files, Next: Deficiencies, Prev: Diagnostics, Up: Top
|
|||
|
|
|||
|
Files
|
|||
|
=====
|
|||
|
|
|||
|
`-lfl'
|
|||
|
library with which scanners must be linked.
|
|||
|
|
|||
|
`lex.yy.c'
|
|||
|
generated scanner (called `lexyy.c' on some systems).
|
|||
|
|
|||
|
`lex.yy.cc'
|
|||
|
generated C++ scanner class, when using `-+'.
|
|||
|
|
|||
|
`<FlexLexer.h>'
|
|||
|
header file defining the C++ scanner base class, `FlexLexer', and
|
|||
|
its derived class, `yyFlexLexer'.
|
|||
|
|
|||
|
`flex.skl'
|
|||
|
skeleton scanner. This file is only used when building flex, not
|
|||
|
when flex executes.
|
|||
|
|
|||
|
`lex.backup'
|
|||
|
backing-up information for `-b' flag (called `lex.bck' on some
|
|||
|
systems).
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Deficiencies, Next: See also, Prev: Files, Up: Top
|
|||
|
|
|||
|
Deficiencies / Bugs
|
|||
|
===================
|
|||
|
|
|||
|
Some trailing context patterns cannot be properly matched and
|
|||
|
generate warning messages ("dangerous trailing context"). These are
|
|||
|
patterns where the ending of the first part of the rule matches the
|
|||
|
beginning of the second part, such as "zx*/xy*", where the 'x*' matches
|
|||
|
the 'x' at the beginning of the trailing context. (Note that the POSIX
|
|||
|
draft states that the text matched by such patterns is undefined.)
|
|||
|
|
|||
|
For some trailing context rules, parts which are actually
|
|||
|
fixed-length are not recognized as such, leading to the abovementioned
|
|||
|
performance loss. In particular, parts using '|' or {n} (such as
|
|||
|
"foo{3}") are always considered variable-length.
|
|||
|
|
|||
|
Combining trailing context with the special '|' action can result in
|
|||
|
*fixed* trailing context being turned into the more expensive VARIABLE
|
|||
|
trailing context. For example, in the following:
|
|||
|
|
|||
|
%%
|
|||
|
abc |
|
|||
|
xyz/def
|
|||
|
|
|||
|
Use of `unput()' invalidates yytext and yyleng, unless the `%array'
|
|||
|
directive or the `-l' option has been used.
|
|||
|
|
|||
|
Pattern-matching of NUL's is substantially slower than matching
|
|||
|
other characters.
|
|||
|
|
|||
|
Dynamic resizing of the input buffer is slow, as it entails
|
|||
|
rescanning all the text matched so far by the current (generally huge)
|
|||
|
token.
|
|||
|
|
|||
|
Due to both buffering of input and read-ahead, you cannot intermix
|
|||
|
calls to <stdio.h> routines, such as, for example, `getchar()', with
|
|||
|
`flex' rules and expect it to work. Call `input()' instead.
|
|||
|
|
|||
|
The total table entries listed by the `-v' flag excludes the number
|
|||
|
of table entries needed to determine what rule has been matched. The
|
|||
|
number of entries is equal to the number of DFA states if the scanner
|
|||
|
does not use `REJECT', and somewhat greater than the number of states
|
|||
|
if it does.
|
|||
|
|
|||
|
`REJECT' cannot be used with the `-f' or `-F' options.
|
|||
|
|
|||
|
The `flex' internal algorithms need documentation.
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: See also, Next: Author, Prev: Deficiencies, Up: Top
|
|||
|
|
|||
|
See also
|
|||
|
========
|
|||
|
|
|||
|
`lex'(1), `yacc'(1), `sed'(1), `awk'(1).
|
|||
|
|
|||
|
John Levine, Tony Mason, and Doug Brown: Lex & Yacc; O'Reilly and
|
|||
|
Associates. Be sure to get the 2nd edition.
|
|||
|
|
|||
|
M. E. Lesk and E. Schmidt, LEX - Lexical Analyzer Generator.
|
|||
|
|
|||
|
Alfred Aho, Ravi Sethi and Jeffrey Ullman: Compilers: Principles,
|
|||
|
Techniques and Tools; Addison-Wesley (1986). Describes the
|
|||
|
pattern-matching techniques used by `flex' (deterministic finite
|
|||
|
automata).
|
|||
|
|
|||
|
|
|||
|
File: flex.info, Node: Author, Prev: See also, Up: Top
|
|||
|
|
|||
|
Author
|
|||
|
======
|
|||
|
|
|||
|
Vern Paxson, with the help of many ideas and much inspiration from
|
|||
|
Van Jacobson. Original version by Jef Poskanzer. The fast table
|
|||
|
representation is a partial implementation of a design done by Van
|
|||
|
Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
|
|||
|
|
|||
|
Thanks to the many `flex' beta-testers, feedbackers, and
|
|||
|
contributors, especially Francois Pinard, Casey Leedom, Stan Adermann,
|
|||
|
Terry Allen, David Barker-Plummer, John Basrai, Nelson H.F. Beebe,
|
|||
|
`benson@odi.com', Karl Berry, Peter A. Bigot, Simon Blanchard, Keith
|
|||
|
Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher, Brian
|
|||
|
Clapper, J.T. Conklin, Jason Coughlin, Bill Cox, Nick Cropper, Dave
|
|||
|
Curtis, Scott David Daniels, Chris G. Demetriou, Theo Deraadt, Mike
|
|||
|
Donahue, Chuck Doucette, Tom Epperly, Leo Eskin, Chris Faylor, Chris
|
|||
|
Flatters, Jon Forrest, Joe Gayda, Kaveh R. Ghazi, Eric Goldman,
|
|||
|
Christopher M. Gould, Ulrich Grepel, Peer Griebel, Jan Hajic, Charles
|
|||
|
Hemphill, NORO Hideo, Jarkko Hietaniemi, Scott Hofmann, Jeff Honig,
|
|||
|
Dana Hudes, Eric Hughes, John Interrante, Ceriel Jacobs, Michal
|
|||
|
Jaegermann, Sakari Jalovaara, Jeffrey R. Jones, Henry Juengst, Klaus
|
|||
|
Kaempf, Jonathan I. Kamens, Terrence O Kane, Amir Katz,
|
|||
|
`ken@ken.hilco.com', Kevin B. Kenny, Steve Kirsch, Winfried Koenig,
|
|||
|
Marq Kole, Ronald Lamprecht, Greg Lee, Rohan Lenard, Craig Leres, John
|
|||
|
Levine, Steve Liddle, Mike Long, Mohamed el Lozy, Brian Madsen, Malte,
|
|||
|
Joe Marshall, Bengt Martensson, Chris Metcalf, Luke Mewburn, Jim
|
|||
|
Meyering, R. Alexander Milowski, Erik Naggum, G.T. Nicol, Landon Noll,
|
|||
|
James Nordby, Marc Nozell, Richard Ohnemus, Karsten Pahnke, Sven Panne,
|
|||
|
Roland Pesch, Walter Pelissero, Gaumond Pierre, Esmond Pitt, Jef
|
|||
|
Poskanzer, Joe Rahmeh, Jarmo Raiha, Frederic Raimbault, Pat Rankin,
|
|||
|
Rick Richardson, Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto
|
|||
|
Santini, Andreas Scherer, Darrell Schiebel, Raf Schietekat, Doug
|
|||
|
Schmidt, Philippe Schnoebelen, Andreas Schwab, Alex Siegel, Eckehard
|
|||
|
Stolz, Jan-Erik Strvmquist, Mike Stump, Paul Stuart, Dave Tallman, Ian
|
|||
|
Lance Taylor, Chris Thewalt, Richard M. Timoney, Jodi Tsai, Paul
|
|||
|
Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken
|
|||
|
Yap, Ron Zellar, Nathan Zelle, David Zuhn, and those whose names have
|
|||
|
slipped my marginal mail-archiving skills but whose contributions are
|
|||
|
appreciated all the same.
|
|||
|
|
|||
|
Thanks to Keith Bostic, Jon Forrest, Noah Friedman, John Gilmore,
|
|||
|
Craig Leres, John Levine, Bob Mulcahy, G.T. Nicol, Francois Pinard,
|
|||
|
Rich Salz, and Richard Stallman for help with various distribution
|
|||
|
headaches.
|
|||
|
|
|||
|
Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
|
|||
|
to Benson Margulies and Fred Burke for C++ support; to Kent Williams
|
|||
|
and Tom Epperly for C++ class support; to Ove Ewerlid for support of
|
|||
|
NUL's; and to Eric Hughes for support of multiple buffers.
|
|||
|
|
|||
|
This work was primarily done when I was with the Real Time Systems
|
|||
|
Group at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks
|
|||
|
to all there for the support I received.
|
|||
|
|
|||
|
Send comments to `vern@ee.lbl.gov'.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Tag Table:
|
|||
|
Node: Top1430
|
|||
|
Node: Name2808
|
|||
|
Node: Synopsis2933
|
|||
|
Node: Overview3145
|
|||
|
Node: Description4986
|
|||
|
Node: Examples5748
|
|||
|
Node: Format8896
|
|||
|
Node: Patterns11637
|
|||
|
Node: Matching18138
|
|||
|
Node: Actions21438
|
|||
|
Node: Generated scanner30560
|
|||
|
Node: Start conditions34988
|
|||
|
Node: Multiple buffers45069
|
|||
|
Node: End-of-file rules50975
|
|||
|
Node: Miscellaneous52508
|
|||
|
Node: User variables55279
|
|||
|
Node: YACC interface57651
|
|||
|
Node: Options58542
|
|||
|
Node: Performance78234
|
|||
|
Node: C++87532
|
|||
|
Node: Incompatibilities94993
|
|||
|
Node: Diagnostics101853
|
|||
|
Node: Files105094
|
|||
|
Node: Deficiencies105715
|
|||
|
Node: See also107684
|
|||
|
Node: Author108216
|
|||
|
|
|||
|
End Tag Table
|