84d9c625bf
- Fix for possible unset uid/gid in toproto - Fix for default mtree style - Update libelf - Importing libexecinfo - Resynchronize GCC, mpc, gmp, mpfr - build.sh: Replace params with show-params. This has been done as the make target has been renamed in the same way, while a new target named params has been added. This new target generates a file containing all the parameters, instead of printing it on the console. - Update test48 with new etc/services (Fix by Ben Gras <ben@minix3.org) get getservbyport() out of the inner loop Change-Id: Ie6ad5226fa2621ff9f0dee8782ea48f9443d2091
8616 lines
287 KiB
Text
8616 lines
287 KiB
Text
\input texinfo.tex @c -*-texinfo-*-
|
|
@c $NetBSD: flex.texi,v 1.1.1.2 2013/04/06 14:05:50 christos Exp $
|
|
@c %**start of header
|
|
@setfilename flex.info
|
|
@include version.texi
|
|
@settitle Lexical Analysis With Flex, for Flex @value{VERSION}
|
|
@set authors Vern Paxson, Will Estes and John Millaway
|
|
@c "Macro Hooks" index
|
|
@defindex hk
|
|
@c "Options" index
|
|
@defindex op
|
|
@dircategory Programming
|
|
@direntry
|
|
* flex: (flex). Fast lexical analyzer generator (lex replacement).
|
|
@end direntry
|
|
@c %**end of header
|
|
|
|
@copying
|
|
|
|
The flex manual is placed under the same licensing conditions as the
|
|
rest of flex:
|
|
|
|
Copyright @copyright{} 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2012
|
|
The Flex Project.
|
|
|
|
Copyright @copyright{} 1990, 1997 The Regents of the University of California.
|
|
All rights reserved.
|
|
|
|
This code is derived from software contributed to Berkeley by
|
|
Vern Paxson.
|
|
|
|
The United States Government has rights in this work pursuant
|
|
to contract no. DE-AC03-76SF00098 between the United States
|
|
Department of Energy and the University of California.
|
|
|
|
Redistribution and use in source and binary forms, with or without
|
|
modification, are permitted provided that the following conditions
|
|
are met:
|
|
|
|
@enumerate
|
|
@item
|
|
Redistributions of source code must retain the above copyright
|
|
notice, this list of conditions and the following disclaimer.
|
|
|
|
@item
|
|
Redistributions in binary form must reproduce the above copyright
|
|
notice, this list of conditions and the following disclaimer in the
|
|
documentation and/or other materials provided with the distribution.
|
|
@end enumerate
|
|
|
|
Neither the name of the University nor the names of its contributors
|
|
may be used to endorse or promote products derived from this software
|
|
without specific prior written permission.
|
|
|
|
THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
|
|
IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
|
|
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
|
|
PURPOSE.
|
|
@end copying
|
|
|
|
@titlepage
|
|
@title Lexical Analysis with Flex
|
|
@subtitle Edition @value{EDITION}, @value{UPDATED}
|
|
@author @value{authors}
|
|
@page
|
|
@vskip 0pt plus 1filll
|
|
@insertcopying
|
|
@end titlepage
|
|
@contents
|
|
@ifnottex
|
|
@node Top, Copyright, (dir), (dir)
|
|
@top flex
|
|
|
|
This manual describes @code{flex}, a tool for generating programs that
|
|
perform pattern-matching on text. The manual includes both tutorial and
|
|
reference sections.
|
|
|
|
This edition of @cite{The flex Manual} documents @code{flex} version
|
|
@value{VERSION}. It was last updated on @value{UPDATED}.
|
|
|
|
This manual was written by @value{authors}.
|
|
|
|
@menu
|
|
* Copyright::
|
|
* Reporting Bugs::
|
|
* Introduction::
|
|
* Simple Examples::
|
|
* Format::
|
|
* Patterns::
|
|
* Matching::
|
|
* Actions::
|
|
* Generated Scanner::
|
|
* Start Conditions::
|
|
* Multiple Input Buffers::
|
|
* EOF::
|
|
* Misc Macros::
|
|
* User Values::
|
|
* Yacc::
|
|
* Scanner Options::
|
|
* Performance::
|
|
* Cxx::
|
|
* Reentrant::
|
|
* Lex and Posix::
|
|
* Memory Management::
|
|
* Serialized Tables::
|
|
* Diagnostics::
|
|
* Limitations::
|
|
* Bibliography::
|
|
* FAQ::
|
|
* Appendices::
|
|
* Indices::
|
|
|
|
@detailmenu
|
|
--- The Detailed Node Listing ---
|
|
|
|
Format of the Input File
|
|
|
|
* Definitions Section::
|
|
* Rules Section::
|
|
* User Code Section::
|
|
* Comments in the Input::
|
|
|
|
Scanner Options
|
|
|
|
* Options for Specifying Filenames::
|
|
* Options Affecting Scanner Behavior::
|
|
* Code-Level And API Options::
|
|
* Options for Scanner Speed and Size::
|
|
* Debugging Options::
|
|
* Miscellaneous Options::
|
|
|
|
Reentrant C Scanners
|
|
|
|
* Reentrant Uses::
|
|
* Reentrant Overview::
|
|
* Reentrant Example::
|
|
* Reentrant Detail::
|
|
* Reentrant Functions::
|
|
|
|
The Reentrant API in Detail
|
|
|
|
* Specify Reentrant::
|
|
* Extra Reentrant Argument::
|
|
* Global Replacement::
|
|
* Init and Destroy Functions::
|
|
* Accessor Methods::
|
|
* Extra Data::
|
|
* About yyscan_t::
|
|
|
|
Memory Management
|
|
|
|
* The Default Memory Management::
|
|
* Overriding The Default Memory Management::
|
|
* A Note About yytext And Memory::
|
|
|
|
Serialized Tables
|
|
|
|
* Creating Serialized Tables::
|
|
* Loading and Unloading Serialized Tables::
|
|
* Tables File Format::
|
|
|
|
FAQ
|
|
|
|
* When was flex born?::
|
|
* How do I expand backslash-escape sequences in C-style quoted strings?::
|
|
* Why do flex scanners call fileno if it is not ANSI compatible?::
|
|
* Does flex support recursive pattern definitions?::
|
|
* How do I skip huge chunks of input (tens of megabytes) while using flex?::
|
|
* Flex is not matching my patterns in the same order that I defined them.::
|
|
* My actions are executing out of order or sometimes not at all.::
|
|
* How can I have multiple input sources feed into the same scanner at the same time?::
|
|
* Can I build nested parsers that work with the same input file?::
|
|
* How can I match text only at the end of a file?::
|
|
* How can I make REJECT cascade across start condition boundaries?::
|
|
* Why cant I use fast or full tables with interactive mode?::
|
|
* How much faster is -F or -f than -C?::
|
|
* If I have a simple grammar cant I just parse it with flex?::
|
|
* Why doesn't yyrestart() set the start state back to INITIAL?::
|
|
* How can I match C-style comments?::
|
|
* The period isn't working the way I expected.::
|
|
* Can I get the flex manual in another format?::
|
|
* Does there exist a "faster" NDFA->DFA algorithm?::
|
|
* How does flex compile the DFA so quickly?::
|
|
* How can I use more than 8192 rules?::
|
|
* How do I abandon a file in the middle of a scan and switch to a new file?::
|
|
* How do I execute code only during initialization (only before the first scan)?::
|
|
* How do I execute code at termination?::
|
|
* Where else can I find help?::
|
|
* Can I include comments in the "rules" section of the file?::
|
|
* I get an error about undefined yywrap().::
|
|
* How can I change the matching pattern at run time?::
|
|
* How can I expand macros in the input?::
|
|
* How can I build a two-pass scanner?::
|
|
* How do I match any string not matched in the preceding rules?::
|
|
* I am trying to port code from AT&T lex that uses yysptr and yysbuf.::
|
|
* Is there a way to make flex treat NULL like a regular character?::
|
|
* Whenever flex can not match the input it says "flex scanner jammed".::
|
|
* Why doesn't flex have non-greedy operators like perl does?::
|
|
* Memory leak - 16386 bytes allocated by malloc.::
|
|
* How do I track the byte offset for lseek()?::
|
|
* How do I use my own I/O classes in a C++ scanner?::
|
|
* How do I skip as many chars as possible?::
|
|
* deleteme00::
|
|
* Are certain equivalent patterns faster than others?::
|
|
* Is backing up a big deal?::
|
|
* Can I fake multi-byte character support?::
|
|
* deleteme01::
|
|
* Can you discuss some flex internals?::
|
|
* unput() messes up yy_at_bol::
|
|
* The | operator is not doing what I want::
|
|
* Why can't flex understand this variable trailing context pattern?::
|
|
* The ^ operator isn't working::
|
|
* Trailing context is getting confused with trailing optional patterns::
|
|
* Is flex GNU or not?::
|
|
* ERASEME53::
|
|
* I need to scan if-then-else blocks and while loops::
|
|
* ERASEME55::
|
|
* ERASEME56::
|
|
* ERASEME57::
|
|
* Is there a repository for flex scanners?::
|
|
* How can I conditionally compile or preprocess my flex input file?::
|
|
* Where can I find grammars for lex and yacc?::
|
|
* I get an end-of-buffer message for each character scanned.::
|
|
* unnamed-faq-62::
|
|
* unnamed-faq-63::
|
|
* unnamed-faq-64::
|
|
* unnamed-faq-65::
|
|
* unnamed-faq-66::
|
|
* unnamed-faq-67::
|
|
* unnamed-faq-68::
|
|
* unnamed-faq-69::
|
|
* unnamed-faq-70::
|
|
* unnamed-faq-71::
|
|
* unnamed-faq-72::
|
|
* unnamed-faq-73::
|
|
* unnamed-faq-74::
|
|
* unnamed-faq-75::
|
|
* unnamed-faq-76::
|
|
* unnamed-faq-77::
|
|
* unnamed-faq-78::
|
|
* unnamed-faq-79::
|
|
* unnamed-faq-80::
|
|
* unnamed-faq-81::
|
|
* unnamed-faq-82::
|
|
* unnamed-faq-83::
|
|
* unnamed-faq-84::
|
|
* unnamed-faq-85::
|
|
* unnamed-faq-86::
|
|
* unnamed-faq-87::
|
|
* unnamed-faq-88::
|
|
* unnamed-faq-90::
|
|
* unnamed-faq-91::
|
|
* unnamed-faq-92::
|
|
* unnamed-faq-93::
|
|
* unnamed-faq-94::
|
|
* unnamed-faq-95::
|
|
* unnamed-faq-96::
|
|
* unnamed-faq-97::
|
|
* unnamed-faq-98::
|
|
* unnamed-faq-99::
|
|
* unnamed-faq-100::
|
|
* unnamed-faq-101::
|
|
* What is the difference between YYLEX_PARAM and YY_DECL?::
|
|
* Why do I get "conflicting types for yylex" error?::
|
|
* How do I access the values set in a Flex action from within a Bison action?::
|
|
|
|
Appendices
|
|
|
|
* Makefiles and Flex::
|
|
* Bison Bridge::
|
|
* M4 Dependency::
|
|
* Common Patterns::
|
|
|
|
Indices
|
|
|
|
* Concept Index::
|
|
* Index of Functions and Macros::
|
|
* Index of Variables::
|
|
* Index of Data Types::
|
|
* Index of Hooks::
|
|
* Index of Scanner Options::
|
|
|
|
@end detailmenu
|
|
@end menu
|
|
@end ifnottex
|
|
@node Copyright, Reporting Bugs, Top, Top
|
|
@chapter Copyright
|
|
|
|
@cindex copyright of flex
|
|
@cindex distributing flex
|
|
@insertcopying
|
|
|
|
@node Reporting Bugs, Introduction, Copyright, Top
|
|
@chapter Reporting Bugs
|
|
|
|
@cindex bugs, reporting
|
|
@cindex reporting bugs
|
|
|
|
If you find a bug in @code{flex}, please report it using
|
|
the SourceForge Bug Tracking facilities which can be found on
|
|
@url{http://sourceforge.net/projects/flex,flex's SourceForge Page}.
|
|
|
|
@node Introduction, Simple Examples, Reporting Bugs, Top
|
|
@chapter Introduction
|
|
|
|
@cindex scanner, definition of
|
|
@code{flex} is a tool for generating @dfn{scanners}. A scanner is a
|
|
program which recognizes lexical patterns in text. The @code{flex}
|
|
program reads the given input files, or its standard input if no file
|
|
names are given, for a description of a scanner to generate. The
|
|
description is in the form of pairs of regular expressions and C code,
|
|
called @dfn{rules}. @code{flex} generates as output a C source file,
|
|
@file{lex.yy.c} by default, which defines a routine @code{yylex()}.
|
|
This file can be compiled and linked with the flex runtime library to
|
|
produce an executable. When the executable is run, it analyzes its
|
|
input for occurrences of the regular expressions. Whenever it finds
|
|
one, it executes the corresponding C code.
|
|
|
|
@node Simple Examples, Format, Introduction, Top
|
|
@chapter Some Simple Examples
|
|
|
|
First some simple examples to get the flavor of how one uses
|
|
@code{flex}.
|
|
|
|
@cindex username expansion
|
|
The following @code{flex} input specifies a scanner which, when it
|
|
encounters the string @samp{username} will replace it with the user's
|
|
login name:
|
|
|
|
@example
|
|
@verbatim
|
|
%%
|
|
username printf( "%s", getlogin() );
|
|
@end verbatim
|
|
@end example
|
|
|
|
@cindex default rule
|
|
@cindex rules, default
|
|
By default, any text not matched by a @code{flex} scanner is copied to
|
|
the output, so the net effect of this scanner is to copy its input file
|
|
to its output with each occurrence of @samp{username} expanded. In this
|
|
input, there is just one rule. @samp{username} is the @dfn{pattern} and
|
|
the @samp{printf} is the @dfn{action}. The @samp{%%} symbol marks the
|
|
beginning of the rules.
|
|
|
|
Here's another simple example:
|
|
|
|
@cindex counting characters and lines
|
|
@example
|
|
@verbatim
|
|
int num_lines = 0, num_chars = 0;
|
|
|
|
%%
|
|
\n ++num_lines; ++num_chars;
|
|
. ++num_chars;
|
|
|
|
%%
|
|
|
|
int main()
|
|
{
|
|
yylex();
|
|
printf( "# of lines = %d, # of chars = %d\n",
|
|
num_lines, num_chars );
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
This scanner counts the number of characters and the number of lines in
|
|
its input. It produces no output other than the final report on the
|
|
character and line counts. The first line declares two globals,
|
|
@code{num_lines} and @code{num_chars}, which are accessible both inside
|
|
@code{yylex()} and in the @code{main()} routine declared after the
|
|
second @samp{%%}. There are two rules, one which matches a newline
|
|
(@samp{\n}) and increments both the line count and the character count,
|
|
and one which matches any character other than a newline (indicated by
|
|
the @samp{.} regular expression).
|
|
|
|
A somewhat more complicated example:
|
|
|
|
@cindex Pascal-like language
|
|
@example
|
|
@verbatim
|
|
/* scanner for a toy Pascal-like language */
|
|
|
|
%{
|
|
/* need this for the call to atof() below */
|
|
#include <math.h>
|
|
%}
|
|
|
|
DIGIT [0-9]
|
|
ID [a-z][a-z0-9]*
|
|
|
|
%%
|
|
|
|
{DIGIT}+ {
|
|
printf( "An integer: %s (%d)\n", yytext,
|
|
atoi( yytext ) );
|
|
}
|
|
|
|
{DIGIT}+"."{DIGIT}* {
|
|
printf( "A float: %s (%g)\n", yytext,
|
|
atof( yytext ) );
|
|
}
|
|
|
|
if|then|begin|end|procedure|function {
|
|
printf( "A keyword: %s\n", yytext );
|
|
}
|
|
|
|
{ID} printf( "An identifier: %s\n", yytext );
|
|
|
|
"+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext );
|
|
|
|
"{"[\^{}}\n]*"}" /* eat up one-line comments */
|
|
|
|
[ \t\n]+ /* eat up whitespace */
|
|
|
|
. printf( "Unrecognized character: %s\n", yytext );
|
|
|
|
%%
|
|
|
|
int main( int argc, char **argv )
|
|
{
|
|
++argv, --argc; /* skip over program name */
|
|
if ( argc > 0 )
|
|
yyin = fopen( argv[0], "r" );
|
|
else
|
|
yyin = stdin;
|
|
|
|
yylex();
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
This is the beginnings of a simple scanner for a language like Pascal.
|
|
It identifies different types of @dfn{tokens} and reports on what it has
|
|
seen.
|
|
|
|
The details of this example will be explained in the following
|
|
sections.
|
|
|
|
@node Format, Patterns, Simple Examples, Top
|
|
@chapter Format of the Input File
|
|
|
|
|
|
@cindex format of flex input
|
|
@cindex input, format of
|
|
@cindex file format
|
|
@cindex sections of flex input
|
|
|
|
The @code{flex} input file consists of three sections, separated by a
|
|
line containing only @samp{%%}.
|
|
|
|
@cindex format of input file
|
|
@example
|
|
@verbatim
|
|
definitions
|
|
%%
|
|
rules
|
|
%%
|
|
user code
|
|
@end verbatim
|
|
@end example
|
|
|
|
@menu
|
|
* Definitions Section::
|
|
* Rules Section::
|
|
* User Code Section::
|
|
* Comments in the Input::
|
|
@end menu
|
|
|
|
@node Definitions Section, Rules Section, Format, Format
|
|
@section Format of the Definitions Section
|
|
|
|
@cindex input file, Definitions section
|
|
@cindex Definitions, in flex input
|
|
The @dfn{definitions section} contains declarations of simple @dfn{name}
|
|
definitions to simplify the scanner specification, and declarations of
|
|
@dfn{start conditions}, which are explained in a later section.
|
|
|
|
@cindex aliases, how to define
|
|
@cindex pattern aliases, how to define
|
|
Name definitions have the form:
|
|
|
|
@example
|
|
@verbatim
|
|
name definition
|
|
@end verbatim
|
|
@end example
|
|
|
|
The @samp{name} is a word beginning with a letter or an underscore
|
|
(@samp{_}) followed by zero or more letters, digits, @samp{_}, or
|
|
@samp{-} (dash). The definition is taken to begin at the first
|
|
non-whitespace character following the name and continuing to the end of
|
|
the line. The definition can subsequently be referred to using
|
|
@samp{@{name@}}, which will expand to @samp{(definition)}. For example,
|
|
|
|
@cindex pattern aliases, defining
|
|
@cindex defining pattern aliases
|
|
@example
|
|
@verbatim
|
|
DIGIT [0-9]
|
|
ID [a-z][a-z0-9]*
|
|
@end verbatim
|
|
@end example
|
|
|
|
Defines @samp{DIGIT} to be a regular expression which matches a single
|
|
digit, and @samp{ID} to be a regular expression which matches a letter
|
|
followed by zero-or-more letters-or-digits. A subsequent reference to
|
|
|
|
@cindex pattern aliases, use of
|
|
@example
|
|
@verbatim
|
|
{DIGIT}+"."{DIGIT}*
|
|
@end verbatim
|
|
@end example
|
|
|
|
is identical to
|
|
|
|
@example
|
|
@verbatim
|
|
([0-9])+"."([0-9])*
|
|
@end verbatim
|
|
@end example
|
|
|
|
and matches one-or-more digits followed by a @samp{.} followed by
|
|
zero-or-more digits.
|
|
|
|
@cindex comments in flex input
|
|
An unindented comment (i.e., a line
|
|
beginning with @samp{/*}) is copied verbatim to the output up
|
|
to the next @samp{*/}.
|
|
|
|
@cindex %@{ and %@}, in Definitions Section
|
|
@cindex embedding C code in flex input
|
|
@cindex C code in flex input
|
|
Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}}
|
|
is also copied verbatim to the output (with the %@{ and %@} symbols
|
|
removed). The %@{ and %@} symbols must appear unindented on lines by
|
|
themselves.
|
|
|
|
@cindex %top
|
|
|
|
A @code{%top} block is similar to a @samp{%@{} ... @samp{%@}} block, except
|
|
that the code in a @code{%top} block is relocated to the @emph{top} of the
|
|
generated file, before any flex definitions @footnote{Actually,
|
|
@code{yyIN_HEADER} is defined before the @samp{%top} block.}.
|
|
The @code{%top} block is useful when you want certain preprocessor macros to be
|
|
defined or certain files to be included before the generated code.
|
|
The single characters, @samp{@{} and @samp{@}} are used to delimit the
|
|
@code{%top} block, as show in the example below:
|
|
|
|
@example
|
|
@verbatim
|
|
%top{
|
|
/* This code goes at the "top" of the generated file. */
|
|
#include <stdint.h>
|
|
#include <inttypes.h>
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
Multiple @code{%top} blocks are allowed, and their order is preserved.
|
|
|
|
@node Rules Section, User Code Section, Definitions Section, Format
|
|
@section Format of the Rules Section
|
|
|
|
@cindex input file, Rules Section
|
|
@cindex rules, in flex input
|
|
The @dfn{rules} section of the @code{flex} input contains a series of
|
|
rules of the form:
|
|
|
|
@example
|
|
@verbatim
|
|
pattern action
|
|
@end verbatim
|
|
@end example
|
|
|
|
where the pattern must be unindented and the action must begin
|
|
on the same line.
|
|
@xref{Patterns}, for a further description of patterns and actions.
|
|
|
|
In the rules section, any indented or %@{ %@} enclosed text appearing
|
|
before the first rule may be used to declare variables which are local
|
|
to the scanning routine and (after the declarations) code which is to be
|
|
executed whenever the scanning routine is entered. Other indented or
|
|
%@{ %@} text in the rule section is still copied to the output, but its
|
|
meaning is not well-defined and it may well cause compile-time errors
|
|
(this feature is present for @acronym{POSIX} compliance. @xref{Lex and
|
|
Posix}, for other such features).
|
|
|
|
Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}}
|
|
is copied verbatim to the output (with the %@{ and %@} symbols removed).
|
|
The %@{ and %@} symbols must appear unindented on lines by themselves.
|
|
|
|
@node User Code Section, Comments in the Input, Rules Section, Format
|
|
@section Format of the User Code Section
|
|
|
|
@cindex input file, user code Section
|
|
@cindex user code, in flex input
|
|
The user code section is simply copied to @file{lex.yy.c} verbatim. It
|
|
is used for companion routines which call or are called by the scanner.
|
|
The presence of this section is optional; if it is missing, the second
|
|
@samp{%%} in the input file may be skipped, too.
|
|
|
|
@node Comments in the Input, , User Code Section, Format
|
|
@section Comments in the Input
|
|
|
|
@cindex comments, syntax of
|
|
Flex supports C-style comments, that is, anything between @samp{/*} and
|
|
@samp{*/} is
|
|
considered a comment. Whenever flex encounters a comment, it copies the
|
|
entire comment verbatim to the generated source code. Comments may
|
|
appear just about anywhere, but with the following exceptions:
|
|
|
|
@itemize
|
|
@cindex comments, in rules section
|
|
@item
|
|
Comments may not appear in the Rules Section wherever flex is expecting
|
|
a regular expression. This means comments may not appear at the
|
|
beginning of a line, or immediately following a list of scanner states.
|
|
@item
|
|
Comments may not appear on an @samp{%option} line in the Definitions
|
|
Section.
|
|
@end itemize
|
|
|
|
If you want to follow a simple rule, then always begin a comment on a
|
|
new line, with one or more whitespace characters before the initial
|
|
@samp{/*}). This rule will work anywhere in the input file.
|
|
|
|
All the comments in the following example are valid:
|
|
|
|
@cindex comments, valid uses of
|
|
@cindex comments in the input
|
|
@example
|
|
@verbatim
|
|
%{
|
|
/* code block */
|
|
%}
|
|
|
|
/* Definitions Section */
|
|
%x STATE_X
|
|
|
|
%%
|
|
/* Rules Section */
|
|
ruleA /* after regex */ { /* code block */ } /* after code block */
|
|
/* Rules Section (indented) */
|
|
<STATE_X>{
|
|
ruleC ECHO;
|
|
ruleD ECHO;
|
|
%{
|
|
/* code block */
|
|
%}
|
|
}
|
|
%%
|
|
/* User Code Section */
|
|
|
|
@end verbatim
|
|
@end example
|
|
|
|
@node Patterns, Matching, Format, Top
|
|
@chapter Patterns
|
|
|
|
@cindex patterns, in rules section
|
|
@cindex regular expressions, in patterns
|
|
The patterns in the input (see @ref{Rules Section}) are written using an
|
|
extended set of regular expressions. These are:
|
|
|
|
@cindex patterns, syntax
|
|
@cindex patterns, syntax
|
|
@table @samp
|
|
@item x
|
|
match the character 'x'
|
|
|
|
@item .
|
|
any character (byte) except newline
|
|
|
|
@cindex [] in patterns
|
|
@cindex character classes in patterns, syntax of
|
|
@cindex POSIX, character classes in patterns, syntax of
|
|
@item [xyz]
|
|
a @dfn{character class}; in this case, the pattern
|
|
matches either an 'x', a 'y', or a 'z'
|
|
|
|
@cindex ranges in patterns
|
|
@item [abj-oZ]
|
|
a "character class" with a range in it; matches
|
|
an 'a', a 'b', any letter from 'j' through 'o',
|
|
or a 'Z'
|
|
|
|
@cindex ranges in patterns, negating
|
|
@cindex negating ranges in patterns
|
|
@item [^A-Z]
|
|
a "negated character class", i.e., any character
|
|
but those in the class. In this case, any
|
|
character EXCEPT an uppercase letter.
|
|
|
|
@item [^A-Z\n]
|
|
any character EXCEPT an uppercase letter or
|
|
a newline
|
|
|
|
@item [a-z]@{-@}[aeiou]
|
|
the lowercase consonants
|
|
|
|
@item r*
|
|
zero or more r's, where r is any regular expression
|
|
|
|
@item r+
|
|
one or more r's
|
|
|
|
@item r?
|
|
zero or one r's (that is, ``an optional r'')
|
|
|
|
@cindex braces in patterns
|
|
@item r@{2,5@}
|
|
anywhere from two to five r's
|
|
|
|
@item r@{2,@}
|
|
two or more r's
|
|
|
|
@item r@{4@}
|
|
exactly 4 r's
|
|
|
|
@cindex pattern aliases, expansion of
|
|
@item @{name@}
|
|
the expansion of the @samp{name} definition
|
|
(@pxref{Format}).
|
|
|
|
@cindex literal text in patterns, syntax of
|
|
@cindex verbatim text in patterns, syntax of
|
|
@item "[xyz]\"foo"
|
|
the literal string: @samp{[xyz]"foo}
|
|
|
|
@cindex escape sequences in patterns, syntax of
|
|
@item \X
|
|
if X is @samp{a}, @samp{b}, @samp{f}, @samp{n}, @samp{r}, @samp{t}, or
|
|
@samp{v}, then the ANSI-C interpretation of @samp{\x}. Otherwise, a
|
|
literal @samp{X} (used to escape operators such as @samp{*})
|
|
|
|
@cindex NULL character in patterns, syntax of
|
|
@item \0
|
|
a NUL character (ASCII code 0)
|
|
|
|
@cindex octal characters in patterns
|
|
@item \123
|
|
the character with octal value 123
|
|
|
|
@item \x2a
|
|
the character with hexadecimal value 2a
|
|
|
|
@item (r)
|
|
match an @samp{r}; parentheses are used to override precedence (see below)
|
|
|
|
@item (?r-s:pattern)
|
|
apply option @samp{r} and omit option @samp{s} while interpreting pattern.
|
|
Options may be zero or more of the characters @samp{i}, @samp{s}, or @samp{x}.
|
|
|
|
@samp{i} means case-insensitive. @samp{-i} means case-sensitive.
|
|
|
|
@samp{s} alters the meaning of the @samp{.} syntax to match any single byte whatsoever.
|
|
@samp{-s} alters the meaning of @samp{.} to match any byte except @samp{\n}.
|
|
|
|
@samp{x} ignores comments and whitespace in patterns. Whitespace is ignored unless
|
|
it is backslash-escaped, contained within @samp{""}s, or appears inside a
|
|
character class.
|
|
|
|
The following are all valid:
|
|
|
|
@verbatim
|
|
(?:foo) same as (foo)
|
|
(?i:ab7) same as ([aA][bB]7)
|
|
(?-i:ab) same as (ab)
|
|
(?s:.) same as [\x00-\xFF]
|
|
(?-s:.) same as [^\n]
|
|
(?ix-s: a . b) same as ([Aa][^\n][bB])
|
|
(?x:a b) same as ("ab")
|
|
(?x:a\ b) same as ("a b")
|
|
(?x:a" "b) same as ("a b")
|
|
(?x:a[ ]b) same as ("a b")
|
|
(?x:a
|
|
/* comment */
|
|
b
|
|
c) same as (abc)
|
|
@end verbatim
|
|
|
|
@item (?# comment )
|
|
omit everything within @samp{()}. The first @samp{)}
|
|
character encountered ends the pattern. It is not possible to for the comment
|
|
to contain a @samp{)} character. The comment may span lines.
|
|
|
|
@cindex concatenation, in patterns
|
|
@item rs
|
|
the regular expression @samp{r} followed by the regular expression @samp{s}; called
|
|
@dfn{concatenation}
|
|
|
|
@item r|s
|
|
either an @samp{r} or an @samp{s}
|
|
|
|
@cindex trailing context, in patterns
|
|
@item r/s
|
|
an @samp{r} but only if it is followed by an @samp{s}. The text matched by @samp{s} is
|
|
included when determining whether this rule is the longest match, but is
|
|
then returned to the input before the action is executed. So the action
|
|
only sees the text matched by @samp{r}. This type of pattern is called
|
|
@dfn{trailing context}. (There are some combinations of @samp{r/s} that flex
|
|
cannot match correctly. @xref{Limitations}, regarding dangerous trailing
|
|
context.)
|
|
|
|
@cindex beginning of line, in patterns
|
|
@cindex BOL, in patterns
|
|
@item ^r
|
|
an @samp{r}, but only at the beginning of a line (i.e.,
|
|
when just starting to scan, or right after a
|
|
newline has been scanned).
|
|
|
|
@cindex end of line, in patterns
|
|
@cindex EOL, in patterns
|
|
@item r$
|
|
an @samp{r}, but only at the end of a line (i.e., just before a
|
|
newline). Equivalent to @samp{r/\n}.
|
|
|
|
@cindex newline, matching in patterns
|
|
Note that @code{flex}'s notion of ``newline'' is exactly
|
|
whatever the C compiler used to compile @code{flex}
|
|
interprets @samp{\n} as; in particular, on some DOS
|
|
systems you must either filter out @samp{\r}s in the
|
|
input yourself, or explicitly use @samp{r/\r\n} for @samp{r$}.
|
|
|
|
@cindex start conditions, in patterns
|
|
@item <s>r
|
|
an @samp{r}, but only in start condition @code{s} (see @ref{Start
|
|
Conditions} for discussion of start conditions).
|
|
|
|
@item <s1,s2,s3>r
|
|
same, but in any of start conditions @code{s1}, @code{s2}, or @code{s3}.
|
|
|
|
@item <*>r
|
|
an @samp{r} in any start condition, even an exclusive one.
|
|
|
|
@cindex end of file, in patterns
|
|
@cindex EOF in patterns, syntax of
|
|
@item <<EOF>>
|
|
an end-of-file.
|
|
|
|
@item <s1,s2><<EOF>>
|
|
an end-of-file when in start condition @code{s1} or @code{s2}
|
|
@end table
|
|
|
|
Note that inside of a character class, all regular expression operators
|
|
lose their special meaning except escape (@samp{\}) and the character class
|
|
operators, @samp{-}, @samp{]]}, and, at the beginning of the class, @samp{^}.
|
|
|
|
@cindex patterns, precedence of operators
|
|
The regular expressions listed above are grouped according to
|
|
precedence, from highest precedence at the top to lowest at the bottom.
|
|
Those grouped together have equal precedence (see special note on the
|
|
precedence of the repeat operator, @samp{@{@}}, under the documentation
|
|
for the @samp{--posix} POSIX compliance option). For example,
|
|
|
|
@cindex patterns, grouping and precedence
|
|
@example
|
|
@verbatim
|
|
foo|bar*
|
|
@end verbatim
|
|
@end example
|
|
|
|
is the same as
|
|
|
|
@example
|
|
@verbatim
|
|
(foo)|(ba(r*))
|
|
@end verbatim
|
|
@end example
|
|
|
|
since the @samp{*} operator has higher precedence than concatenation,
|
|
and concatenation higher than alternation (@samp{|}). This pattern
|
|
therefore matches @emph{either} the string @samp{foo} @emph{or} the
|
|
string @samp{ba} followed by zero-or-more @samp{r}'s. To match
|
|
@samp{foo} or zero-or-more repetitions of the string @samp{bar}, use:
|
|
|
|
@example
|
|
@verbatim
|
|
foo|(bar)*
|
|
@end verbatim
|
|
@end example
|
|
|
|
And to match a sequence of zero or more repetitions of @samp{foo} and
|
|
@samp{bar}:
|
|
|
|
@cindex patterns, repetitions with grouping
|
|
@example
|
|
@verbatim
|
|
(foo|bar)*
|
|
@end verbatim
|
|
@end example
|
|
|
|
@cindex character classes in patterns
|
|
In addition to characters and ranges of characters, character classes
|
|
can also contain @dfn{character class expressions}. These are
|
|
expressions enclosed inside @samp{[}: and @samp{:]} delimiters (which
|
|
themselves must appear between the @samp{[} and @samp{]} of the
|
|
character class. Other elements may occur inside the character class,
|
|
too). The valid expressions are:
|
|
|
|
@cindex patterns, valid character classes
|
|
@example
|
|
@verbatim
|
|
[:alnum:] [:alpha:] [:blank:]
|
|
[:cntrl:] [:digit:] [:graph:]
|
|
[:lower:] [:print:] [:punct:]
|
|
[:space:] [:upper:] [:xdigit:]
|
|
@end verbatim
|
|
@end example
|
|
|
|
These expressions all designate a set of characters equivalent to the
|
|
corresponding standard C @code{isXXX} function. For example,
|
|
@samp{[:alnum:]} designates those characters for which @code{isalnum()}
|
|
returns true - i.e., any alphabetic or numeric character. Some systems
|
|
don't provide @code{isblank()}, so flex defines @samp{[:blank:]} as a
|
|
blank or a tab.
|
|
|
|
For example, the following character classes are all equivalent:
|
|
|
|
@cindex character classes, equivalence of
|
|
@cindex patterns, character class equivalence
|
|
@example
|
|
@verbatim
|
|
[[:alnum:]]
|
|
[[:alpha:][:digit:]]
|
|
[[:alpha:][0-9]]
|
|
[a-zA-Z0-9]
|
|
@end verbatim
|
|
@end example
|
|
|
|
A word of caution. Character classes are expanded immediately when seen in the @code{flex} input.
|
|
This means the character classes are sensitive to the locale in which @code{flex}
|
|
is executed, and the resulting scanner will not be sensitive to the runtime locale.
|
|
This may or may not be desirable.
|
|
|
|
|
|
@itemize
|
|
@cindex case-insensitive, effect on character classes
|
|
@item If your scanner is case-insensitive (the @samp{-i} flag), then
|
|
@samp{[:upper:]} and @samp{[:lower:]} are equivalent to
|
|
@samp{[:alpha:]}.
|
|
|
|
@anchor{case and character ranges}
|
|
@item Character classes with ranges, such as @samp{[a-Z]}, should be used with
|
|
caution in a case-insensitive scanner if the range spans upper or lowercase
|
|
characters. Flex does not know if you want to fold all upper and lowercase
|
|
characters together, or if you want the literal numeric range specified (with
|
|
no case folding). When in doubt, flex will assume that you meant the literal
|
|
numeric range, and will issue a warning. The exception to this rule is a
|
|
character range such as @samp{[a-z]} or @samp{[S-W]} where it is obvious that you
|
|
want case-folding to occur. Here are some examples with the @samp{-i} flag
|
|
enabled:
|
|
|
|
@multitable {@samp{[a-zA-Z]}} {ambiguous} {@samp{[A-Z\[\\\]_`a-t]}} {@samp{[@@A-Z\[\\\]_`abc]}}
|
|
@item Range @tab Result @tab Literal Range @tab Alternate Range
|
|
@item @samp{[a-t]} @tab ok @tab @samp{[a-tA-T]} @tab
|
|
@item @samp{[A-T]} @tab ok @tab @samp{[a-tA-T]} @tab
|
|
@item @samp{[A-t]} @tab ambiguous @tab @samp{[A-Z\[\\\]_`a-t]} @tab @samp{[a-tA-T]}
|
|
@item @samp{[_-@{]} @tab ambiguous @tab @samp{[_`a-z@{]} @tab @samp{[_`a-zA-Z@{]}
|
|
@item @samp{[@@-C]} @tab ambiguous @tab @samp{[@@ABC]} @tab @samp{[@@A-Z\[\\\]_`abc]}
|
|
@end multitable
|
|
|
|
@cindex end of line, in negated character classes
|
|
@cindex EOL, in negated character classes
|
|
@item
|
|
A negated character class such as the example @samp{[^A-Z]} above
|
|
@emph{will} match a newline unless @samp{\n} (or an equivalent escape
|
|
sequence) is one of the characters explicitly present in the negated
|
|
character class (e.g., @samp{[^A-Z\n]}). This is unlike how many other
|
|
regular expression tools treat negated character classes, but
|
|
unfortunately the inconsistency is historically entrenched. Matching
|
|
newlines means that a pattern like @samp{[^"]*} can match the entire
|
|
input unless there's another quote in the input.
|
|
|
|
Flex allows negation of character class expressions by prepending @samp{^} to
|
|
the POSIX character class name.
|
|
|
|
@example
|
|
@verbatim
|
|
[:^alnum:] [:^alpha:] [:^blank:]
|
|
[:^cntrl:] [:^digit:] [:^graph:]
|
|
[:^lower:] [:^print:] [:^punct:]
|
|
[:^space:] [:^upper:] [:^xdigit:]
|
|
@end verbatim
|
|
@end example
|
|
|
|
Flex will issue a warning if the expressions @samp{[:^upper:]} and
|
|
@samp{[:^lower:]} appear in a case-insensitive scanner, since their meaning is
|
|
unclear. The current behavior is to skip them entirely, but this may change
|
|
without notice in future revisions of flex.
|
|
|
|
@item
|
|
|
|
The @samp{@{-@}} operator computes the difference of two character classes. For
|
|
example, @samp{[a-c]@{-@}[b-z]} represents all the characters in the class
|
|
@samp{[a-c]} that are not in the class @samp{[b-z]} (which in this case, is
|
|
just the single character @samp{a}). The @samp{@{-@}} operator is left
|
|
associative, so @samp{[abc]@{-@}[b]@{-@}[c]} is the same as @samp{[a]}. Be careful
|
|
not to accidentally create an empty set, which will never match.
|
|
|
|
@item
|
|
|
|
The @samp{@{+@}} operator computes the union of two character classes. For
|
|
example, @samp{[a-z]@{+@}[0-9]} is the same as @samp{[a-z0-9]}. This operator
|
|
is useful when preceded by the result of a difference operation, as in,
|
|
@samp{[[:alpha:]]@{-@}[[:lower:]]@{+@}[q]}, which is equivalent to
|
|
@samp{[A-Zq]} in the "C" locale.
|
|
|
|
@cindex trailing context, limits of
|
|
@cindex ^ as non-special character in patterns
|
|
@cindex $ as normal character in patterns
|
|
@item
|
|
A rule can have at most one instance of trailing context (the @samp{/} operator
|
|
or the @samp{$} operator). The start condition, @samp{^}, and @samp{<<EOF>>} patterns
|
|
can only occur at the beginning of a pattern, and, as well as with @samp{/} and @samp{$},
|
|
cannot be grouped inside parentheses. A @samp{^} which does not occur at
|
|
the beginning of a rule or a @samp{$} which does not occur at the end of
|
|
a rule loses its special properties and is treated as a normal character.
|
|
|
|
@item
|
|
The following are invalid:
|
|
|
|
@cindex patterns, invalid trailing context
|
|
@example
|
|
@verbatim
|
|
foo/bar$
|
|
<sc1>foo<sc2>bar
|
|
@end verbatim
|
|
@end example
|
|
|
|
Note that the first of these can be written @samp{foo/bar\n}.
|
|
|
|
@item
|
|
The following will result in @samp{$} or @samp{^} being treated as a normal character:
|
|
|
|
@cindex patterns, special characters treated as non-special
|
|
@example
|
|
@verbatim
|
|
foo|(bar$)
|
|
foo|^bar
|
|
@end verbatim
|
|
@end example
|
|
|
|
If the desired meaning is a @samp{foo} or a
|
|
@samp{bar}-followed-by-a-newline, the following could be used (the
|
|
special @code{|} action is explained below, @pxref{Actions}):
|
|
|
|
@cindex patterns, end of line
|
|
@example
|
|
@verbatim
|
|
foo |
|
|
bar$ /* action goes here */
|
|
@end verbatim
|
|
@end example
|
|
|
|
A similar trick will work for matching a @samp{foo} or a
|
|
@samp{bar}-at-the-beginning-of-a-line.
|
|
@end itemize
|
|
|
|
@node Matching, Actions, Patterns, Top
|
|
@chapter How the Input Is Matched
|
|
|
|
@cindex patterns, matching
|
|
@cindex input, matching
|
|
@cindex trailing context, matching
|
|
@cindex matching, and trailing context
|
|
@cindex matching, length of
|
|
@cindex matching, multiple matches
|
|
When the generated scanner is run, it analyzes its input looking for
|
|
strings which match any of its patterns. If it finds more than one
|
|
match, it takes the one matching the most text (for trailing context
|
|
rules, this includes the length of the trailing part, even though it
|
|
will then be returned to the input). If it finds two or more matches of
|
|
the same length, the rule listed first in the @code{flex} input file is
|
|
chosen.
|
|
|
|
@cindex token
|
|
@cindex yytext
|
|
@cindex yyleng
|
|
Once the match is determined, the text corresponding to the match
|
|
(called the @dfn{token}) is made available in the global character
|
|
pointer @code{yytext}, and its length in the global integer
|
|
@code{yyleng}. The @dfn{action} corresponding to the matched pattern is
|
|
then executed (@pxref{Actions}), and then the remaining input is scanned
|
|
for another match.
|
|
|
|
@cindex default rule
|
|
If no match is found, then the @dfn{default rule} is executed: the next
|
|
character in the input is considered matched and copied to the standard
|
|
output. Thus, the simplest valid @code{flex} input is:
|
|
|
|
@cindex minimal scanner
|
|
@example
|
|
@verbatim
|
|
%%
|
|
@end verbatim
|
|
@end example
|
|
|
|
which generates a scanner that simply copies its input (one character at
|
|
a time) to its output.
|
|
|
|
@cindex yytext, two types of
|
|
@cindex %array, use of
|
|
@cindex %pointer, use of
|
|
@vindex yytext
|
|
Note that @code{yytext} can be defined in two different ways: either as
|
|
a character @emph{pointer} or as a character @emph{array}. You can
|
|
control which definition @code{flex} uses by including one of the
|
|
special directives @code{%pointer} or @code{%array} in the first
|
|
(definitions) section of your flex input. The default is
|
|
@code{%pointer}, unless you use the @samp{-l} lex compatibility option,
|
|
in which case @code{yytext} will be an array. The advantage of using
|
|
@code{%pointer} is substantially faster scanning and no buffer overflow
|
|
when matching very large tokens (unless you run out of dynamic memory).
|
|
The disadvantage is that you are restricted in how your actions can
|
|
modify @code{yytext} (@pxref{Actions}), and calls to the @code{unput()}
|
|
function destroys the present contents of @code{yytext}, which can be a
|
|
considerable porting headache when moving between different @code{lex}
|
|
versions.
|
|
|
|
@cindex %array, advantages of
|
|
The advantage of @code{%array} is that you can then modify @code{yytext}
|
|
to your heart's content, and calls to @code{unput()} do not destroy
|
|
@code{yytext} (@pxref{Actions}). Furthermore, existing @code{lex}
|
|
programs sometimes access @code{yytext} externally using declarations of
|
|
the form:
|
|
|
|
@example
|
|
@verbatim
|
|
extern char yytext[];
|
|
@end verbatim
|
|
@end example
|
|
|
|
This definition is erroneous when used with @code{%pointer}, but correct
|
|
for @code{%array}.
|
|
|
|
The @code{%array} declaration defines @code{yytext} to be an array of
|
|
@code{YYLMAX} characters, which defaults to a fairly large value. You
|
|
can change the size by simply #define'ing @code{YYLMAX} to a different
|
|
value in the first section of your @code{flex} input. As mentioned
|
|
above, with @code{%pointer} yytext grows dynamically to accommodate
|
|
large tokens. While this means your @code{%pointer} scanner can
|
|
accommodate very large tokens (such as matching entire blocks of
|
|
comments), bear in mind that each time the scanner must resize
|
|
@code{yytext} it also must rescan the entire token from the beginning,
|
|
so matching such tokens can prove slow. @code{yytext} presently does
|
|
@emph{not} dynamically grow if a call to @code{unput()} results in too
|
|
much text being pushed back; instead, a run-time error results.
|
|
|
|
@cindex %array, with C++
|
|
Also note that you cannot use @code{%array} with C++ scanner classes
|
|
(@pxref{Cxx}).
|
|
|
|
@node Actions, Generated Scanner, Matching, Top
|
|
@chapter Actions
|
|
|
|
@cindex actions
|
|
Each pattern in a rule has a corresponding @dfn{action}, which can be
|
|
any arbitrary C statement. The pattern ends at the first non-escaped
|
|
whitespace character; the remainder of the line is its action. If the
|
|
action is empty, then when the pattern is matched the input token is
|
|
simply discarded. For example, here is the specification for a program
|
|
which deletes all occurrences of @samp{zap me} from its input:
|
|
|
|
@cindex deleting lines from input
|
|
@example
|
|
@verbatim
|
|
%%
|
|
"zap me"
|
|
@end verbatim
|
|
@end example
|
|
|
|
This example will copy all other characters in the input to the output
|
|
since they will be matched by the default rule.
|
|
|
|
Here is a program which compresses multiple blanks and tabs down to a
|
|
single blank, and throws away whitespace found at the end of a line:
|
|
|
|
@cindex whitespace, compressing
|
|
@cindex compressing whitespace
|
|
@example
|
|
@verbatim
|
|
%%
|
|
[ \t]+ putchar( ' ' );
|
|
[ \t]+$ /* ignore this token */
|
|
@end verbatim
|
|
@end example
|
|
|
|
@cindex %@{ and %@}, in Rules Section
|
|
@cindex actions, use of @{ and @}
|
|
@cindex actions, embedded C strings
|
|
@cindex C-strings, in actions
|
|
@cindex comments, in actions
|
|
If the action contains a @samp{@{}, then the action spans till the
|
|
balancing @samp{@}} is found, and the action may cross multiple lines.
|
|
@code{flex} knows about C strings and comments and won't be fooled by
|
|
braces found within them, but also allows actions to begin with
|
|
@samp{%@{} and will consider the action to be all the text up to the
|
|
next @samp{%@}} (regardless of ordinary braces inside the action).
|
|
|
|
@cindex |, in actions
|
|
An action consisting solely of a vertical bar (@samp{|}) means ``same as the
|
|
action for the next rule''. See below for an illustration.
|
|
|
|
Actions can include arbitrary C code, including @code{return} statements
|
|
to return a value to whatever routine called @code{yylex()}. Each time
|
|
@code{yylex()} is called it continues processing tokens from where it
|
|
last left off until it either reaches the end of the file or executes a
|
|
return.
|
|
|
|
@cindex yytext, modification of
|
|
Actions are free to modify @code{yytext} except for lengthening it
|
|
(adding characters to its end--these will overwrite later characters in
|
|
the input stream). This however does not apply when using @code{%array}
|
|
(@pxref{Matching}). In that case, @code{yytext} may be freely modified
|
|
in any way.
|
|
|
|
@cindex yyleng, modification of
|
|
@cindex yymore, and yyleng
|
|
Actions are free to modify @code{yyleng} except they should not do so if
|
|
the action also includes use of @code{yymore()} (see below).
|
|
|
|
@cindex preprocessor macros, for use in actions
|
|
There are a number of special directives which can be included within an
|
|
action:
|
|
|
|
@table @code
|
|
@item ECHO
|
|
@cindex ECHO
|
|
copies yytext to the scanner's output.
|
|
|
|
@item BEGIN
|
|
@cindex BEGIN
|
|
followed by the name of a start condition places the scanner in the
|
|
corresponding start condition (see below).
|
|
|
|
@item REJECT
|
|
@cindex REJECT
|
|
directs the scanner to proceed on to the ``second best'' rule which
|
|
matched the input (or a prefix of the input). The rule is chosen as
|
|
described above in @ref{Matching}, and @code{yytext} and @code{yyleng}
|
|
set up appropriately. It may either be one which matched as much text
|
|
as the originally chosen rule but came later in the @code{flex} input
|
|
file, or one which matched less text. For example, the following will
|
|
both count the words in the input and call the routine @code{special()}
|
|
whenever @samp{frob} is seen:
|
|
|
|
@example
|
|
@verbatim
|
|
int word_count = 0;
|
|
%%
|
|
|
|
frob special(); REJECT;
|
|
[^ \t\n]+ ++word_count;
|
|
@end verbatim
|
|
@end example
|
|
|
|
Without the @code{REJECT}, any occurrences of @samp{frob} in the input
|
|
would not be counted as words, since the scanner normally executes only
|
|
one action per token. Multiple uses of @code{REJECT} are allowed, each
|
|
one finding the next best choice to the currently active rule. For
|
|
example, when the following scanner scans the token @samp{abcd}, it will
|
|
write @samp{abcdabcaba} to the output:
|
|
|
|
@cindex REJECT, calling multiple times
|
|
@cindex |, use of
|
|
@example
|
|
@verbatim
|
|
%%
|
|
a |
|
|
ab |
|
|
abc |
|
|
abcd ECHO; REJECT;
|
|
.|\n /* eat up any unmatched character */
|
|
@end verbatim
|
|
@end example
|
|
|
|
The first three rules share the fourth's action since they use the
|
|
special @samp{|} action.
|
|
|
|
@code{REJECT} is a particularly expensive feature in terms of scanner
|
|
performance; if it is used in @emph{any} of the scanner's actions it
|
|
will slow down @emph{all} of the scanner's matching. Furthermore,
|
|
@code{REJECT} cannot be used with the @samp{-Cf} or @samp{-CF} options
|
|
(@pxref{Scanner Options}).
|
|
|
|
Note also that unlike the other special actions, @code{REJECT} is a
|
|
@emph{branch}. Code immediately following it in the action will
|
|
@emph{not} be executed.
|
|
|
|
@item yymore()
|
|
@cindex yymore()
|
|
tells the scanner that the next time it matches a rule, the
|
|
corresponding token should be @emph{appended} onto the current value of
|
|
@code{yytext} rather than replacing it. For example, given the input
|
|
@samp{mega-kludge} the following will write @samp{mega-mega-kludge} to
|
|
the output:
|
|
|
|
@cindex yymore(), mega-kludge
|
|
@cindex yymore() to append token to previous token
|
|
@example
|
|
@verbatim
|
|
%%
|
|
mega- ECHO; yymore();
|
|
kludge ECHO;
|
|
@end verbatim
|
|
@end example
|
|
|
|
First @samp{mega-} is matched and echoed to the output. Then @samp{kludge}
|
|
is matched, but the previous @samp{mega-} is still hanging around at the
|
|
beginning of
|
|
@code{yytext}
|
|
so the
|
|
@code{ECHO}
|
|
for the @samp{kludge} rule will actually write @samp{mega-kludge}.
|
|
@end table
|
|
|
|
@cindex yymore, performance penalty of
|
|
Two notes regarding use of @code{yymore()}. First, @code{yymore()}
|
|
depends on the value of @code{yyleng} correctly reflecting the size of
|
|
the current token, so you must not modify @code{yyleng} if you are using
|
|
@code{yymore()}. Second, the presence of @code{yymore()} in the
|
|
scanner's action entails a minor performance penalty in the scanner's
|
|
matching speed.
|
|
|
|
@cindex yyless()
|
|
@code{yyless(n)} returns all but the first @code{n} characters of the
|
|
current token back to the input stream, where they will be rescanned
|
|
when the scanner looks for the next match. @code{yytext} and
|
|
@code{yyleng} are adjusted appropriately (e.g., @code{yyleng} will now
|
|
be equal to @code{n}). For example, on the input @samp{foobar} the
|
|
following will write out @samp{foobarbar}:
|
|
|
|
@cindex yyless(), pushing back characters
|
|
@cindex pushing back characters with yyless
|
|
@example
|
|
@verbatim
|
|
%%
|
|
foobar ECHO; yyless(3);
|
|
[a-z]+ ECHO;
|
|
@end verbatim
|
|
@end example
|
|
|
|
An argument of 0 to @code{yyless()} will cause the entire current input
|
|
string to be scanned again. Unless you've changed how the scanner will
|
|
subsequently process its input (using @code{BEGIN}, for example), this
|
|
will result in an endless loop.
|
|
|
|
Note that @code{yyless()} is a macro and can only be used in the flex
|
|
input file, not from other source files.
|
|
|
|
@cindex unput()
|
|
@cindex pushing back characters with unput
|
|
@code{unput(c)} puts the character @code{c} back onto the input stream.
|
|
It will be the next character scanned. The following action will take
|
|
the current token and cause it to be rescanned enclosed in parentheses.
|
|
|
|
@cindex unput(), pushing back characters
|
|
@cindex pushing back characters with unput()
|
|
@example
|
|
@verbatim
|
|
{
|
|
int i;
|
|
/* Copy yytext because unput() trashes yytext */
|
|
char *yycopy = strdup( yytext );
|
|
unput( ')' );
|
|
for ( i = yyleng - 1; i >= 0; --i )
|
|
unput( yycopy[i] );
|
|
unput( '(' );
|
|
free( yycopy );
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
Note that since each @code{unput()} puts the given character back at the
|
|
@emph{beginning} of the input stream, pushing back strings must be done
|
|
back-to-front.
|
|
|
|
@cindex %pointer, and unput()
|
|
@cindex unput(), and %pointer
|
|
An important potential problem when using @code{unput()} is that if you
|
|
are using @code{%pointer} (the default), a call to @code{unput()}
|
|
@emph{destroys} the contents of @code{yytext}, starting with its
|
|
rightmost character and devouring one character to the left with each
|
|
call. If you need the value of @code{yytext} preserved after a call to
|
|
@code{unput()} (as in the above example), you must either first copy it
|
|
elsewhere, or build your scanner using @code{%array} instead
|
|
(@pxref{Matching}).
|
|
|
|
@cindex pushing back EOF
|
|
@cindex EOF, pushing back
|
|
Finally, note that you cannot put back @samp{EOF} to attempt to mark the
|
|
input stream with an end-of-file.
|
|
|
|
@cindex input()
|
|
@code{input()} reads the next character from the input stream. For
|
|
example, the following is one way to eat up C comments:
|
|
|
|
@cindex comments, discarding
|
|
@cindex discarding C comments
|
|
@example
|
|
@verbatim
|
|
%%
|
|
"/*" {
|
|
register int c;
|
|
|
|
for ( ; ; )
|
|
{
|
|
while ( (c = input()) != '*' &&
|
|
c != EOF )
|
|
; /* eat up text of comment */
|
|
|
|
if ( c == '*' )
|
|
{
|
|
while ( (c = input()) == '*' )
|
|
;
|
|
if ( c == '/' )
|
|
break; /* found the end */
|
|
}
|
|
|
|
if ( c == EOF )
|
|
{
|
|
error( "EOF in comment" );
|
|
break;
|
|
}
|
|
}
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
@cindex input(), and C++
|
|
@cindex yyinput()
|
|
(Note that if the scanner is compiled using @code{C++}, then
|
|
@code{input()} is instead referred to as @b{yyinput()}, in order to
|
|
avoid a name clash with the @code{C++} stream by the name of
|
|
@code{input}.)
|
|
|
|
@cindex flushing the internal buffer
|
|
@cindex YY_FLUSH_BUFFER
|
|
@code{YY_FLUSH_BUFFER;} flushes the scanner's internal buffer so that
|
|
the next time the scanner attempts to match a token, it will first
|
|
refill the buffer using @code{YY_INPUT()} (@pxref{Generated Scanner}).
|
|
This action is a special case of the more general
|
|
@code{yy_flush_buffer;} function, described below (@pxref{Multiple
|
|
Input Buffers})
|
|
|
|
@cindex yyterminate()
|
|
@cindex terminating with yyterminate()
|
|
@cindex exiting with yyterminate()
|
|
@cindex halting with yyterminate()
|
|
@code{yyterminate()} can be used in lieu of a return statement in an
|
|
action. It terminates the scanner and returns a 0 to the scanner's
|
|
caller, indicating ``all done''. By default, @code{yyterminate()} is
|
|
also called when an end-of-file is encountered. It is a macro and may
|
|
be redefined.
|
|
|
|
@node Generated Scanner, Start Conditions, Actions, Top
|
|
@chapter The Generated Scanner
|
|
|
|
@cindex yylex(), in generated scanner
|
|
The output of @code{flex} is the file @file{lex.yy.c}, which contains
|
|
the scanning routine @code{yylex()}, a number of tables used by it for
|
|
matching tokens, and a number of auxiliary routines and macros. By
|
|
default, @code{yylex()} is declared as follows:
|
|
|
|
@example
|
|
@verbatim
|
|
int yylex()
|
|
{
|
|
... various definitions and the actions in here ...
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
@cindex yylex(), overriding
|
|
(If your environment supports function prototypes, then it will be
|
|
@code{int yylex( void )}.) This definition may be changed by defining
|
|
the @code{YY_DECL} macro. For example, you could use:
|
|
|
|
@cindex yylex, overriding the prototype of
|
|
@example
|
|
@verbatim
|
|
#define YY_DECL float lexscan( a, b ) float a, b;
|
|
@end verbatim
|
|
@end example
|
|
|
|
to give the scanning routine the name @code{lexscan}, returning a float,
|
|
and taking two floats as arguments. Note that if you give arguments to
|
|
the scanning routine using a K&R-style/non-prototyped function
|
|
declaration, you must terminate the definition with a semi-colon (;).
|
|
|
|
@code{flex} generates @samp{C99} function definitions by
|
|
default. However flex does have the ability to generate obsolete, er,
|
|
@samp{traditional}, function definitions. This is to support
|
|
bootstrapping gcc on old systems. Unfortunately, traditional
|
|
definitions prevent us from using any standard data types smaller than
|
|
int (such as short, char, or bool) as function arguments. For this
|
|
reason, future versions of @code{flex} may generate standard C99 code
|
|
only, leaving K&R-style functions to the historians. Currently, if you
|
|
do @strong{not} want @samp{C99} definitions, then you must use
|
|
@code{%option noansi-definitions}.
|
|
|
|
@cindex stdin, default for yyin
|
|
@cindex yyin
|
|
Whenever @code{yylex()} is called, it scans tokens from the global input
|
|
file @file{yyin} (which defaults to stdin). It continues until it
|
|
either reaches an end-of-file (at which point it returns the value 0) or
|
|
one of its actions executes a @code{return} statement.
|
|
|
|
@cindex EOF and yyrestart()
|
|
@cindex end-of-file, and yyrestart()
|
|
@cindex yyrestart()
|
|
If the scanner reaches an end-of-file, subsequent calls are undefined
|
|
unless either @file{yyin} is pointed at a new input file (in which case
|
|
scanning continues from that file), or @code{yyrestart()} is called.
|
|
@code{yyrestart()} takes one argument, a @code{FILE *} pointer (which
|
|
can be NULL, if you've set up @code{YY_INPUT} to scan from a source other
|
|
than @code{yyin}), and initializes @file{yyin} for scanning from that
|
|
file. Essentially there is no difference between just assigning
|
|
@file{yyin} to a new input file or using @code{yyrestart()} to do so;
|
|
the latter is available for compatibility with previous versions of
|
|
@code{flex}, and because it can be used to switch input files in the
|
|
middle of scanning. It can also be used to throw away the current input
|
|
buffer, by calling it with an argument of @file{yyin}; but it would be
|
|
better to use @code{YY_FLUSH_BUFFER} (@pxref{Actions}). Note that
|
|
@code{yyrestart()} does @emph{not} reset the start condition to
|
|
@code{INITIAL} (@pxref{Start Conditions}).
|
|
|
|
@cindex RETURN, within actions
|
|
If @code{yylex()} stops scanning due to executing a @code{return}
|
|
statement in one of the actions, the scanner may then be called again
|
|
and it will resume scanning where it left off.
|
|
|
|
@cindex YY_INPUT
|
|
By default (and for purposes of efficiency), the scanner uses
|
|
block-reads rather than simple @code{getc()} calls to read characters
|
|
from @file{yyin}. The nature of how it gets its input can be controlled
|
|
by defining the @code{YY_INPUT} macro. The calling sequence for
|
|
@code{YY_INPUT()} is @code{YY_INPUT(buf,result,max_size)}. Its action
|
|
is to place up to @code{max_size} characters in the character array
|
|
@code{buf} and return in the integer variable @code{result} either the
|
|
number of characters read or the constant @code{YY_NULL} (0 on Unix
|
|
systems) to indicate @samp{EOF}. The default @code{YY_INPUT} reads from
|
|
the global file-pointer @file{yyin}.
|
|
|
|
@cindex YY_INPUT, overriding
|
|
Here is a sample definition of @code{YY_INPUT} (in the definitions
|
|
section of the input file):
|
|
|
|
@example
|
|
@verbatim
|
|
%{
|
|
#define YY_INPUT(buf,result,max_size) \
|
|
{ \
|
|
int c = getchar(); \
|
|
result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
|
|
}
|
|
%}
|
|
@end verbatim
|
|
@end example
|
|
|
|
This definition will change the input processing to occur one character
|
|
at a time.
|
|
|
|
@cindex yywrap()
|
|
When the scanner receives an end-of-file indication from YY_INPUT, it
|
|
then checks the @code{yywrap()} function. If @code{yywrap()} returns
|
|
false (zero), then it is assumed that the function has gone ahead and
|
|
set up @file{yyin} to point to another input file, and scanning
|
|
continues. If it returns true (non-zero), then the scanner terminates,
|
|
returning 0 to its caller. Note that in either case, the start
|
|
condition remains unchanged; it does @emph{not} revert to
|
|
@code{INITIAL}.
|
|
|
|
@cindex yywrap, default for
|
|
@cindex noyywrap, %option
|
|
@cindex %option noyywrapp
|
|
If you do not supply your own version of @code{yywrap()}, then you must
|
|
either use @code{%option noyywrap} (in which case the scanner behaves as
|
|
though @code{yywrap()} returned 1), or you must link with @samp{-lfl} to
|
|
obtain the default version of the routine, which always returns 1.
|
|
|
|
For scanning from in-memory buffers (e.g., scanning strings), see
|
|
@ref{Scanning Strings}. @xref{Multiple Input Buffers}.
|
|
|
|
@cindex ECHO, and yyout
|
|
@cindex yyout
|
|
@cindex stdout, as default for yyout
|
|
The scanner writes its @code{ECHO} output to the @file{yyout} global
|
|
(default, @file{stdout}), which may be redefined by the user simply by
|
|
assigning it to some other @code{FILE} pointer.
|
|
|
|
@node Start Conditions, Multiple Input Buffers, Generated Scanner, Top
|
|
@chapter Start Conditions
|
|
|
|
@cindex start conditions
|
|
@code{flex} provides a mechanism for conditionally activating rules.
|
|
Any rule whose pattern is prefixed with @samp{<sc>} will only be active
|
|
when the scanner is in the @dfn{start condition} named @code{sc}. For
|
|
example,
|
|
|
|
@c proofread edit stopped here
|
|
@example
|
|
@verbatim
|
|
<STRING>[^"]* { /* eat up the string body ... */
|
|
...
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
will be active only when the scanner is in the @code{STRING} start
|
|
condition, and
|
|
|
|
@cindex start conditions, multiple
|
|
@example
|
|
@verbatim
|
|
<INITIAL,STRING,QUOTE>\. { /* handle an escape ... */
|
|
...
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
will be active only when the current start condition is either
|
|
@code{INITIAL}, @code{STRING}, or @code{QUOTE}.
|
|
|
|
@cindex start conditions, inclusive v.s.@: exclusive
|
|
Start conditions are declared in the definitions (first) section of the
|
|
input using unindented lines beginning with either @samp{%s} or
|
|
@samp{%x} followed by a list of names. The former declares
|
|
@dfn{inclusive} start conditions, the latter @dfn{exclusive} start
|
|
conditions. A start condition is activated using the @code{BEGIN}
|
|
action. Until the next @code{BEGIN} action is executed, rules with the
|
|
given start condition will be active and rules with other start
|
|
conditions will be inactive. If the start condition is inclusive, then
|
|
rules with no start conditions at all will also be active. If it is
|
|
exclusive, then @emph{only} rules qualified with the start condition
|
|
will be active. A set of rules contingent on the same exclusive start
|
|
condition describe a scanner which is independent of any of the other
|
|
rules in the @code{flex} input. Because of this, exclusive start
|
|
conditions make it easy to specify ``mini-scanners'' which scan portions
|
|
of the input that are syntactically different from the rest (e.g.,
|
|
comments).
|
|
|
|
If the distinction between inclusive and exclusive start conditions
|
|
is still a little vague, here's a simple example illustrating the
|
|
connection between the two. The set of rules:
|
|
|
|
@cindex start conditions, inclusive
|
|
@example
|
|
@verbatim
|
|
%s example
|
|
%%
|
|
|
|
<example>foo do_something();
|
|
|
|
bar something_else();
|
|
@end verbatim
|
|
@end example
|
|
|
|
is equivalent to
|
|
|
|
@cindex start conditions, exclusive
|
|
@example
|
|
@verbatim
|
|
%x example
|
|
%%
|
|
|
|
<example>foo do_something();
|
|
|
|
<INITIAL,example>bar something_else();
|
|
@end verbatim
|
|
@end example
|
|
|
|
Without the @code{<INITIAL,example>} qualifier, the @code{bar} pattern in
|
|
the second example wouldn't be active (i.e., couldn't match) when in
|
|
start condition @code{example}. If we just used @code{<example>} to
|
|
qualify @code{bar}, though, then it would only be active in
|
|
@code{example} and not in @code{INITIAL}, while in the first example
|
|
it's active in both, because in the first example the @code{example}
|
|
start condition is an inclusive @code{(%s)} start condition.
|
|
|
|
@cindex start conditions, special wildcard condition
|
|
Also note that the special start-condition specifier
|
|
@code{<*>}
|
|
matches every start condition. Thus, the above example could also
|
|
have been written:
|
|
|
|
@cindex start conditions, use of wildcard condition (<*>)
|
|
@example
|
|
@verbatim
|
|
%x example
|
|
%%
|
|
|
|
<example>foo do_something();
|
|
|
|
<*>bar something_else();
|
|
@end verbatim
|
|
@end example
|
|
|
|
The default rule (to @code{ECHO} any unmatched character) remains active
|
|
in start conditions. It is equivalent to:
|
|
|
|
@cindex start conditions, behavior of default rule
|
|
@example
|
|
@verbatim
|
|
<*>.|\n ECHO;
|
|
@end verbatim
|
|
@end example
|
|
|
|
@cindex BEGIN, explanation
|
|
@findex BEGIN
|
|
@vindex INITIAL
|
|
@code{BEGIN(0)} returns to the original state where only the rules with
|
|
no start conditions are active. This state can also be referred to as
|
|
the start-condition @code{INITIAL}, so @code{BEGIN(INITIAL)} is
|
|
equivalent to @code{BEGIN(0)}. (The parentheses around the start
|
|
condition name are not required but are considered good style.)
|
|
|
|
@code{BEGIN} actions can also be given as indented code at the beginning
|
|
of the rules section. For example, the following will cause the scanner
|
|
to enter the @code{SPECIAL} start condition whenever @code{yylex()} is
|
|
called and the global variable @code{enter_special} is true:
|
|
|
|
@cindex start conditions, using BEGIN
|
|
@example
|
|
@verbatim
|
|
int enter_special;
|
|
|
|
%x SPECIAL
|
|
%%
|
|
if ( enter_special )
|
|
BEGIN(SPECIAL);
|
|
|
|
<SPECIAL>blahblahblah
|
|
...more rules follow...
|
|
@end verbatim
|
|
@end example
|
|
|
|
To illustrate the uses of start conditions, here is a scanner which
|
|
provides two different interpretations of a string like @samp{123.456}.
|
|
By default it will treat it as three tokens, the integer @samp{123}, a
|
|
dot (@samp{.}), and the integer @samp{456}. But if the string is
|
|
preceded earlier in the line by the string @samp{expect-floats} it will
|
|
treat it as a single token, the floating-point number @samp{123.456}:
|
|
|
|
@cindex start conditions, for different interpretations of same input
|
|
@example
|
|
@verbatim
|
|
%{
|
|
#include <math.h>
|
|
%}
|
|
%s expect
|
|
|
|
%%
|
|
expect-floats BEGIN(expect);
|
|
|
|
<expect>[0-9]+.[0-9]+ {
|
|
printf( "found a float, = %f\n",
|
|
atof( yytext ) );
|
|
}
|
|
<expect>\n {
|
|
/* that's the end of the line, so
|
|
* we need another "expect-number"
|
|
* before we'll recognize any more
|
|
* numbers
|
|
*/
|
|
BEGIN(INITIAL);
|
|
}
|
|
|
|
[0-9]+ {
|
|
printf( "found an integer, = %d\n",
|
|
atoi( yytext ) );
|
|
}
|
|
|
|
"." printf( "found a dot\n" );
|
|
@end verbatim
|
|
@end example
|
|
|
|
@cindex comments, example of scanning C comments
|
|
Here is a scanner which recognizes (and discards) C comments while
|
|
maintaining a count of the current input line.
|
|
|
|
@cindex recognizing C comments
|
|
@example
|
|
@verbatim
|
|
%x comment
|
|
%%
|
|
int line_num = 1;
|
|
|
|
"/*" BEGIN(comment);
|
|
|
|
<comment>[^*\n]* /* eat anything that's not a '*' */
|
|
<comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */
|
|
<comment>\n ++line_num;
|
|
<comment>"*"+"/" BEGIN(INITIAL);
|
|
@end verbatim
|
|
@end example
|
|
|
|
This scanner goes to a bit of trouble to match as much
|
|
text as possible with each rule. In general, when attempting to write
|
|
a high-speed scanner try to match as much possible in each rule, as
|
|
it's a big win.
|
|
|
|
Note that start-conditions names are really integer values and
|
|
can be stored as such. Thus, the above could be extended in the
|
|
following fashion:
|
|
|
|
@cindex start conditions, integer values
|
|
@cindex using integer values of start condition names
|
|
@example
|
|
@verbatim
|
|
%x comment foo
|
|
%%
|
|
int line_num = 1;
|
|
int comment_caller;
|
|
|
|
"/*" {
|
|
comment_caller = INITIAL;
|
|
BEGIN(comment);
|
|
}
|
|
|
|
...
|
|
|
|
<foo>"/*" {
|
|
comment_caller = foo;
|
|
BEGIN(comment);
|
|
}
|
|
|
|
<comment>[^*\n]* /* eat anything that's not a '*' */
|
|
<comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */
|
|
<comment>\n ++line_num;
|
|
<comment>"*"+"/" BEGIN(comment_caller);
|
|
@end verbatim
|
|
@end example
|
|
|
|
@cindex YY_START, example
|
|
Furthermore, you can access the current start condition using the
|
|
integer-valued @code{YY_START} macro. For example, the above
|
|
assignments to @code{comment_caller} could instead be written
|
|
|
|
@cindex getting current start state with YY_START
|
|
@example
|
|
@verbatim
|
|
comment_caller = YY_START;
|
|
@end verbatim
|
|
@end example
|
|
|
|
@vindex YY_START
|
|
Flex provides @code{YYSTATE} as an alias for @code{YY_START} (since that
|
|
is what's used by AT&T @code{lex}).
|
|
|
|
For historical reasons, start conditions do not have their own
|
|
name-space within the generated scanner. The start condition names are
|
|
unmodified in the generated scanner and generated header.
|
|
@xref{option-header}. @xref{option-prefix}.
|
|
|
|
|
|
|
|
Finally, here's an example of how to match C-style quoted strings using
|
|
exclusive start conditions, including expanded escape sequences (but
|
|
not including checking for a string that's too long):
|
|
|
|
@cindex matching C-style double-quoted strings
|
|
@example
|
|
@verbatim
|
|
%x str
|
|
|
|
%%
|
|
char string_buf[MAX_STR_CONST];
|
|
char *string_buf_ptr;
|
|
|
|
|
|
\" string_buf_ptr = string_buf; BEGIN(str);
|
|
|
|
<str>\" { /* saw closing quote - all done */
|
|
BEGIN(INITIAL);
|
|
*string_buf_ptr = '\0';
|
|
/* return string constant token type and
|
|
* value to parser
|
|
*/
|
|
}
|
|
|
|
<str>\n {
|
|
/* error - unterminated string constant */
|
|
/* generate error message */
|
|
}
|
|
|
|
<str>\\[0-7]{1,3} {
|
|
/* octal escape sequence */
|
|
int result;
|
|
|
|
(void) sscanf( yytext + 1, "%o", &result );
|
|
|
|
if ( result > 0xff )
|
|
/* error, constant is out-of-bounds */
|
|
|
|
*string_buf_ptr++ = result;
|
|
}
|
|
|
|
<str>\\[0-9]+ {
|
|
/* generate error - bad escape sequence; something
|
|
* like '\48' or '\0777777'
|
|
*/
|
|
}
|
|
|
|
<str>\\n *string_buf_ptr++ = '\n';
|
|
<str>\\t *string_buf_ptr++ = '\t';
|
|
<str>\\r *string_buf_ptr++ = '\r';
|
|
<str>\\b *string_buf_ptr++ = '\b';
|
|
<str>\\f *string_buf_ptr++ = '\f';
|
|
|
|
<str>\\(.|\n) *string_buf_ptr++ = yytext[1];
|
|
|
|
<str>[^\\\n\"]+ {
|
|
char *yptr = yytext;
|
|
|
|
while ( *yptr )
|
|
*string_buf_ptr++ = *yptr++;
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
@cindex start condition, applying to multiple patterns
|
|
Often, such as in some of the examples above, you wind up writing a
|
|
whole bunch of rules all preceded by the same start condition(s). Flex
|
|
makes this a little easier and cleaner by introducing a notion of start
|
|
condition @dfn{scope}. A start condition scope is begun with:
|
|
|
|
@example
|
|
@verbatim
|
|
<SCs>{
|
|
@end verbatim
|
|
@end example
|
|
|
|
where @code{SCs} is a list of one or more start conditions. Inside the
|
|
start condition scope, every rule automatically has the prefix
|
|
@code{SCs>} applied to it, until a @samp{@}} which matches the initial
|
|
@samp{@{}. So, for example,
|
|
|
|
@cindex extended scope of start conditions
|
|
@example
|
|
@verbatim
|
|
<ESC>{
|
|
"\\n" return '\n';
|
|
"\\r" return '\r';
|
|
"\\f" return '\f';
|
|
"\\0" return '\0';
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
is equivalent to:
|
|
|
|
@example
|
|
@verbatim
|
|
<ESC>"\\n" return '\n';
|
|
<ESC>"\\r" return '\r';
|
|
<ESC>"\\f" return '\f';
|
|
<ESC>"\\0" return '\0';
|
|
@end verbatim
|
|
@end example
|
|
|
|
Start condition scopes may be nested.
|
|
|
|
@cindex stacks, routines for manipulating
|
|
@cindex start conditions, use of a stack
|
|
|
|
The following routines are available for manipulating stacks of start conditions:
|
|
|
|
@deftypefun void yy_push_state ( int @code{new_state} )
|
|
pushes the current start condition onto the top of the start condition
|
|
stack and switches to
|
|
@code{new_state}
|
|
as though you had used
|
|
@code{BEGIN new_state}
|
|
(recall that start condition names are also integers).
|
|
@end deftypefun
|
|
|
|
@deftypefun void yy_pop_state ()
|
|
pops the top of the stack and switches to it via
|
|
@code{BEGIN}.
|
|
@end deftypefun
|
|
|
|
@deftypefun int yy_top_state ()
|
|
returns the top of the stack without altering the stack's contents.
|
|
@end deftypefun
|
|
|
|
@cindex memory, for start condition stacks
|
|
The start condition stack grows dynamically and so has no built-in size
|
|
limitation. If memory is exhausted, program execution aborts.
|
|
|
|
To use start condition stacks, your scanner must include a @code{%option
|
|
stack} directive (@pxref{Scanner Options}).
|
|
|
|
@node Multiple Input Buffers, EOF, Start Conditions, Top
|
|
@chapter Multiple Input Buffers
|
|
|
|
@cindex multiple input streams
|
|
Some scanners (such as those which support ``include'' files) require
|
|
reading from several input streams. As @code{flex} scanners do a large
|
|
amount of buffering, one cannot control where the next input will be
|
|
read from by simply writing a @code{YY_INPUT()} which is sensitive to
|
|
the scanning context. @code{YY_INPUT()} is only called when the scanner
|
|
reaches the end of its buffer, which may be a long time after scanning a
|
|
statement such as an @code{include} statement which requires switching
|
|
the input source.
|
|
|
|
To negotiate these sorts of problems, @code{flex} provides a mechanism
|
|
for creating and switching between multiple input buffers. An input
|
|
buffer is created by using:
|
|
|
|
@cindex memory, allocating input buffers
|
|
@deftypefun YY_BUFFER_STATE yy_create_buffer ( FILE *file, int size )
|
|
@end deftypefun
|
|
|
|
which takes a @code{FILE} pointer and a size and creates a buffer
|
|
associated with the given file and large enough to hold @code{size}
|
|
characters (when in doubt, use @code{YY_BUF_SIZE} for the size). It
|
|
returns a @code{YY_BUFFER_STATE} handle, which may then be passed to
|
|
other routines (see below).
|
|
@tindex YY_BUFFER_STATE
|
|
The @code{YY_BUFFER_STATE} type is a
|
|
pointer to an opaque @code{struct yy_buffer_state} structure, so you may
|
|
safely initialize @code{YY_BUFFER_STATE} variables to @code{((YY_BUFFER_STATE)
|
|
0)} if you wish, and also refer to the opaque structure in order to
|
|
correctly declare input buffers in source files other than that of your
|
|
scanner. Note that the @code{FILE} pointer in the call to
|
|
@code{yy_create_buffer} is only used as the value of @file{yyin} seen by
|
|
@code{YY_INPUT}. If you redefine @code{YY_INPUT()} so it no longer uses
|
|
@file{yyin}, then you can safely pass a NULL @code{FILE} pointer to
|
|
@code{yy_create_buffer}. You select a particular buffer to scan from
|
|
using:
|
|
|
|
@deftypefun void yy_switch_to_buffer ( YY_BUFFER_STATE new_buffer )
|
|
@end deftypefun
|
|
|
|
The above function switches the scanner's input buffer so subsequent tokens
|
|
will come from @code{new_buffer}. Note that @code{yy_switch_to_buffer()} may
|
|
be used by @code{yywrap()} to set things up for continued scanning, instead of
|
|
opening a new file and pointing @file{yyin} at it. If you are looking for a
|
|
stack of input buffers, then you want to use @code{yypush_buffer_state()}
|
|
instead of this function. Note also that switching input sources via either
|
|
@code{yy_switch_to_buffer()} or @code{yywrap()} does @emph{not} change the
|
|
start condition.
|
|
|
|
@cindex memory, deleting input buffers
|
|
@deftypefun void yy_delete_buffer ( YY_BUFFER_STATE buffer )
|
|
@end deftypefun
|
|
|
|
is used to reclaim the storage associated with a buffer. (@code{buffer}
|
|
can be NULL, in which case the routine does nothing.) You can also clear
|
|
the current contents of a buffer using:
|
|
|
|
@cindex pushing an input buffer
|
|
@cindex stack, input buffer push
|
|
@deftypefun void yypush_buffer_state ( YY_BUFFER_STATE buffer )
|
|
@end deftypefun
|
|
|
|
This function pushes the new buffer state onto an internal stack. The pushed
|
|
state becomes the new current state. The stack is maintained by flex and will
|
|
grow as required. This function is intended to be used instead of
|
|
@code{yy_switch_to_buffer}, when you want to change states, but preserve the
|
|
current state for later use.
|
|
|
|
@cindex popping an input buffer
|
|
@cindex stack, input buffer pop
|
|
@deftypefun void yypop_buffer_state ( )
|
|
@end deftypefun
|
|
|
|
This function removes the current state from the top of the stack, and deletes
|
|
it by calling @code{yy_delete_buffer}. The next state on the stack, if any,
|
|
becomes the new current state.
|
|
|
|
@cindex clearing an input buffer
|
|
@cindex flushing an input buffer
|
|
@deftypefun void yy_flush_buffer ( YY_BUFFER_STATE buffer )
|
|
@end deftypefun
|
|
|
|
This function discards the buffer's contents,
|
|
so the next time the scanner attempts to match a token from the
|
|
buffer, it will first fill the buffer anew using
|
|
@code{YY_INPUT()}.
|
|
|
|
@deftypefun YY_BUFFER_STATE yy_new_buffer ( FILE *file, int size )
|
|
@end deftypefun
|
|
|
|
is an alias for @code{yy_create_buffer()},
|
|
provided for compatibility with the C++ use of @code{new} and
|
|
@code{delete} for creating and destroying dynamic objects.
|
|
|
|
@cindex YY_CURRENT_BUFFER, and multiple buffers Finally, the macro
|
|
@code{YY_CURRENT_BUFFER} macro returns a @code{YY_BUFFER_STATE} handle to the
|
|
current buffer. It should not be used as an lvalue.
|
|
|
|
@cindex EOF, example using multiple input buffers
|
|
Here are two examples of using these features for writing a scanner
|
|
which expands include files (the
|
|
@code{<<EOF>>}
|
|
feature is discussed below).
|
|
|
|
This first example uses yypush_buffer_state and yypop_buffer_state. Flex
|
|
maintains the stack internally.
|
|
|
|
@cindex handling include files with multiple input buffers
|
|
@example
|
|
@verbatim
|
|
/* the "incl" state is used for picking up the name
|
|
* of an include file
|
|
*/
|
|
%x incl
|
|
%%
|
|
include BEGIN(incl);
|
|
|
|
[a-z]+ ECHO;
|
|
[^a-z\n]*\n? ECHO;
|
|
|
|
<incl>[ \t]* /* eat the whitespace */
|
|
<incl>[^ \t\n]+ { /* got the include file name */
|
|
yyin = fopen( yytext, "r" );
|
|
|
|
if ( ! yyin )
|
|
error( ... );
|
|
|
|
yypush_buffer_state(yy_create_buffer( yyin, YY_BUF_SIZE ));
|
|
|
|
BEGIN(INITIAL);
|
|
}
|
|
|
|
<<EOF>> {
|
|
yypop_buffer_state();
|
|
|
|
if ( !YY_CURRENT_BUFFER )
|
|
{
|
|
yyterminate();
|
|
}
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
The second example, below, does the same thing as the previous example did, but
|
|
manages its own input buffer stack manually (instead of letting flex do it).
|
|
|
|
@cindex handling include files with multiple input buffers
|
|
@example
|
|
@verbatim
|
|
/* the "incl" state is used for picking up the name
|
|
* of an include file
|
|
*/
|
|
%x incl
|
|
|
|
%{
|
|
#define MAX_INCLUDE_DEPTH 10
|
|
YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
|
|
int include_stack_ptr = 0;
|
|
%}
|
|
|
|
%%
|
|
include BEGIN(incl);
|
|
|
|
[a-z]+ ECHO;
|
|
[^a-z\n]*\n? ECHO;
|
|
|
|
<incl>[ \t]* /* eat the whitespace */
|
|
<incl>[^ \t\n]+ { /* got the include file name */
|
|
if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
|
|
{
|
|
fprintf( stderr, "Includes nested too deeply" );
|
|
exit( 1 );
|
|
}
|
|
|
|
include_stack[include_stack_ptr++] =
|
|
YY_CURRENT_BUFFER;
|
|
|
|
yyin = fopen( yytext, "r" );
|
|
|
|
if ( ! yyin )
|
|
error( ... );
|
|
|
|
yy_switch_to_buffer(
|
|
yy_create_buffer( yyin, YY_BUF_SIZE ) );
|
|
|
|
BEGIN(INITIAL);
|
|
}
|
|
|
|
<<EOF>> {
|
|
if ( --include_stack_ptr 0 )
|
|
{
|
|
yyterminate();
|
|
}
|
|
|
|
else
|
|
{
|
|
yy_delete_buffer( YY_CURRENT_BUFFER );
|
|
yy_switch_to_buffer(
|
|
include_stack[include_stack_ptr] );
|
|
}
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
@anchor{Scanning Strings}
|
|
@cindex strings, scanning strings instead of files
|
|
The following routines are available for setting up input buffers for
|
|
scanning in-memory strings instead of files. All of them create a new
|
|
input buffer for scanning the string, and return a corresponding
|
|
@code{YY_BUFFER_STATE} handle (which you should delete with
|
|
@code{yy_delete_buffer()} when done with it). They also switch to the
|
|
new buffer using @code{yy_switch_to_buffer()}, so the next call to
|
|
@code{yylex()} will start scanning the string.
|
|
|
|
@deftypefun YY_BUFFER_STATE yy_scan_string ( const char *str )
|
|
scans a NUL-terminated string.
|
|
@end deftypefun
|
|
|
|
@deftypefun YY_BUFFER_STATE yy_scan_bytes ( const char *bytes, int len )
|
|
scans @code{len} bytes (including possibly @code{NUL}s) starting at location
|
|
@code{bytes}.
|
|
@end deftypefun
|
|
|
|
Note that both of these functions create and scan a @emph{copy} of the
|
|
string or bytes. (This may be desirable, since @code{yylex()} modifies
|
|
the contents of the buffer it is scanning.) You can avoid the copy by
|
|
using:
|
|
|
|
@vindex YY_END_OF_BUFFER_CHAR
|
|
@deftypefun YY_BUFFER_STATE yy_scan_buffer (char *base, yy_size_t size)
|
|
which scans in place the buffer starting at @code{base}, consisting of
|
|
@code{size} bytes, the last two bytes of which @emph{must} be
|
|
@code{YY_END_OF_BUFFER_CHAR} (ASCII NUL). These last two bytes are not
|
|
scanned; thus, scanning consists of @code{base[0]} through
|
|
@code{base[size-2]}, inclusive.
|
|
@end deftypefun
|
|
|
|
If you fail to set up @code{base} in this manner (i.e., forget the final
|
|
two @code{YY_END_OF_BUFFER_CHAR} bytes), then @code{yy_scan_buffer()}
|
|
returns a NULL pointer instead of creating a new input buffer.
|
|
|
|
@deftp {Data type} yy_size_t
|
|
is an integral type to which you can cast an integer expression
|
|
reflecting the size of the buffer.
|
|
@end deftp
|
|
|
|
@node EOF, Misc Macros, Multiple Input Buffers, Top
|
|
@chapter End-of-File Rules
|
|
|
|
@cindex EOF, explanation
|
|
The special rule @code{<<EOF>>} indicates
|
|
actions which are to be taken when an end-of-file is
|
|
encountered and @code{yywrap()} returns non-zero (i.e., indicates
|
|
no further files to process). The action must finish
|
|
by doing one of the following things:
|
|
|
|
@itemize
|
|
@item
|
|
@findex YY_NEW_FILE (now obsolete)
|
|
assigning @file{yyin} to a new input file (in previous versions of
|
|
@code{flex}, after doing the assignment you had to call the special
|
|
action @code{YY_NEW_FILE}. This is no longer necessary.)
|
|
|
|
@item
|
|
executing a @code{return} statement;
|
|
|
|
@item
|
|
executing the special @code{yyterminate()} action.
|
|
|
|
@item
|
|
or, switching to a new buffer using @code{yy_switch_to_buffer()} as
|
|
shown in the example above.
|
|
@end itemize
|
|
|
|
<<EOF>> rules may not be used with other patterns; they may only be
|
|
qualified with a list of start conditions. If an unqualified <<EOF>>
|
|
rule is given, it applies to @emph{all} start conditions which do not
|
|
already have <<EOF>> actions. To specify an <<EOF>> rule for only the
|
|
initial start condition, use:
|
|
|
|
@example
|
|
@verbatim
|
|
<INITIAL><<EOF>>
|
|
@end verbatim
|
|
@end example
|
|
|
|
These rules are useful for catching things like unclosed comments. An
|
|
example:
|
|
|
|
@cindex <<EOF>>, use of
|
|
@example
|
|
@verbatim
|
|
%x quote
|
|
%%
|
|
|
|
...other rules for dealing with quotes...
|
|
|
|
<quote><<EOF>> {
|
|
error( "unterminated quote" );
|
|
yyterminate();
|
|
}
|
|
<<EOF>> {
|
|
if ( *++filelist )
|
|
yyin = fopen( *filelist, "r" );
|
|
else
|
|
yyterminate();
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
@node Misc Macros, User Values, EOF, Top
|
|
@chapter Miscellaneous Macros
|
|
|
|
@hkindex YY_USER_ACTION
|
|
The macro @code{YY_USER_ACTION} can be defined to provide an action
|
|
which is always executed prior to the matched rule's action. For
|
|
example, it could be #define'd to call a routine to convert yytext to
|
|
lower-case. When @code{YY_USER_ACTION} is invoked, the variable
|
|
@code{yy_act} gives the number of the matched rule (rules are numbered
|
|
starting with 1). Suppose you want to profile how often each of your
|
|
rules is matched. The following would do the trick:
|
|
|
|
@cindex YY_USER_ACTION to track each time a rule is matched
|
|
@example
|
|
@verbatim
|
|
#define YY_USER_ACTION ++ctr[yy_act]
|
|
@end verbatim
|
|
@end example
|
|
|
|
@vindex YY_NUM_RULES
|
|
where @code{ctr} is an array to hold the counts for the different rules.
|
|
Note that the macro @code{YY_NUM_RULES} gives the total number of rules
|
|
(including the default rule), even if you use @samp{-s)}, so a correct
|
|
declaration for @code{ctr} is:
|
|
|
|
@example
|
|
@verbatim
|
|
int ctr[YY_NUM_RULES];
|
|
@end verbatim
|
|
@end example
|
|
|
|
@hkindex YY_USER_INIT
|
|
The macro @code{YY_USER_INIT} may be defined to provide an action which
|
|
is always executed before the first scan (and before the scanner's
|
|
internal initializations are done). For example, it could be used to
|
|
call a routine to read in a data table or open a logging file.
|
|
|
|
@findex yy_set_interactive
|
|
The macro @code{yy_set_interactive(is_interactive)} can be used to
|
|
control whether the current buffer is considered @dfn{interactive}. An
|
|
interactive buffer is processed more slowly, but must be used when the
|
|
scanner's input source is indeed interactive to avoid problems due to
|
|
waiting to fill buffers (see the discussion of the @samp{-I} flag in
|
|
@ref{Scanner Options}). A non-zero value in the macro invocation marks
|
|
the buffer as interactive, a zero value as non-interactive. Note that
|
|
use of this macro overrides @code{%option always-interactive} or
|
|
@code{%option never-interactive} (@pxref{Scanner Options}).
|
|
@code{yy_set_interactive()} must be invoked prior to beginning to scan
|
|
the buffer that is (or is not) to be considered interactive.
|
|
|
|
@cindex BOL, setting it
|
|
@findex yy_set_bol
|
|
The macro @code{yy_set_bol(at_bol)} can be used to control whether the
|
|
current buffer's scanning context for the next token match is done as
|
|
though at the beginning of a line. A non-zero macro argument makes
|
|
rules anchored with @samp{^} active, while a zero argument makes
|
|
@samp{^} rules inactive.
|
|
|
|
@cindex BOL, checking the BOL flag
|
|
@findex YY_AT_BOL
|
|
The macro @code{YY_AT_BOL()} returns true if the next token scanned from
|
|
the current buffer will have @samp{^} rules active, false otherwise.
|
|
|
|
@cindex actions, redefining YY_BREAK
|
|
@hkindex YY_BREAK
|
|
In the generated scanner, the actions are all gathered in one large
|
|
switch statement and separated using @code{YY_BREAK}, which may be
|
|
redefined. By default, it is simply a @code{break}, to separate each
|
|
rule's action from the following rule's. Redefining @code{YY_BREAK}
|
|
allows, for example, C++ users to #define YY_BREAK to do nothing (while
|
|
being very careful that every rule ends with a @code{break} or a
|
|
@code{return}!) to avoid suffering from unreachable statement warnings
|
|
where because a rule's action ends with @code{return}, the
|
|
@code{YY_BREAK} is inaccessible.
|
|
|
|
@node User Values, Yacc, Misc Macros, Top
|
|
@chapter Values Available To the User
|
|
|
|
This chapter summarizes the various values available to the user in the
|
|
rule actions.
|
|
|
|
@table @code
|
|
@vindex yytext
|
|
@item char *yytext
|
|
holds the text of the current token. It may be modified but not
|
|
lengthened (you cannot append characters to the end).
|
|
|
|
@cindex yytext, default array size
|
|
@cindex array, default size for yytext
|
|
@vindex YYLMAX
|
|
If the special directive @code{%array} appears in the first section of
|
|
the scanner description, then @code{yytext} is instead declared
|
|
@code{char yytext[YYLMAX]}, where @code{YYLMAX} is a macro definition
|
|
that you can redefine in the first section if you don't like the default
|
|
value (generally 8KB). Using @code{%array} results in somewhat slower
|
|
scanners, but the value of @code{yytext} becomes immune to calls to
|
|
@code{unput()}, which potentially destroy its value when @code{yytext} is
|
|
a character pointer. The opposite of @code{%array} is @code{%pointer},
|
|
which is the default.
|
|
|
|
@cindex C++ and %array
|
|
You cannot use @code{%array} when generating C++ scanner classes (the
|
|
@samp{-+} flag).
|
|
|
|
@vindex yyleng
|
|
@item int yyleng
|
|
holds the length of the current token.
|
|
|
|
@vindex yyin
|
|
@item FILE *yyin
|
|
is the file which by default @code{flex} reads from. It may be
|
|
redefined but doing so only makes sense before scanning begins or after
|
|
an EOF has been encountered. Changing it in the midst of scanning will
|
|
have unexpected results since @code{flex} buffers its input; use
|
|
@code{yyrestart()} instead. Once scanning terminates because an
|
|
end-of-file has been seen, you can assign @file{yyin} at the new input
|
|
file and then call the scanner again to continue scanning.
|
|
|
|
@findex yyrestart
|
|
@item void yyrestart( FILE *new_file )
|
|
may be called to point @file{yyin} at the new input file. The
|
|
switch-over to the new file is immediate (any previously buffered-up
|
|
input is lost). Note that calling @code{yyrestart()} with @file{yyin}
|
|
as an argument thus throws away the current input buffer and continues
|
|
scanning the same input file.
|
|
|
|
@vindex yyout
|
|
@item FILE *yyout
|
|
is the file to which @code{ECHO} actions are done. It can be reassigned
|
|
by the user.
|
|
|
|
@vindex YY_CURRENT_BUFFER
|
|
@item YY_CURRENT_BUFFER
|
|
returns a @code{YY_BUFFER_STATE} handle to the current buffer.
|
|
|
|
@vindex YY_START
|
|
@item YY_START
|
|
returns an integer value corresponding to the current start condition.
|
|
You can subsequently use this value with @code{BEGIN} to return to that
|
|
start condition.
|
|
@end table
|
|
|
|
@node Yacc, Scanner Options, User Values, Top
|
|
@chapter Interfacing with Yacc
|
|
|
|
@cindex yacc, interface
|
|
|
|
@vindex yylval, with yacc
|
|
One of the main uses of @code{flex} is as a companion to the @code{yacc}
|
|
parser-generator. @code{yacc} parsers expect to call a routine named
|
|
@code{yylex()} to find the next input token. The routine is supposed to
|
|
return the type of the next token as well as putting any associated
|
|
value in the global @code{yylval}. To use @code{flex} with @code{yacc},
|
|
one specifies the @samp{-d} option to @code{yacc} to instruct it to
|
|
generate the file @file{y.tab.h} containing definitions of all the
|
|
@code{%tokens} appearing in the @code{yacc} input. This file is then
|
|
included in the @code{flex} scanner. For example, if one of the tokens
|
|
is @code{TOK_NUMBER}, part of the scanner might look like:
|
|
|
|
@cindex yacc interface
|
|
@example
|
|
@verbatim
|
|
%{
|
|
#include "y.tab.h"
|
|
%}
|
|
|
|
%%
|
|
|
|
[0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
|
|
@end verbatim
|
|
@end example
|
|
|
|
@node Scanner Options, Performance, Yacc, Top
|
|
@chapter Scanner Options
|
|
|
|
@cindex command-line options
|
|
@cindex options, command-line
|
|
@cindex arguments, command-line
|
|
|
|
The various @code{flex} options are categorized by function in the following
|
|
menu. If you want to lookup a particular option by name, @xref{Index of Scanner Options}.
|
|
|
|
@menu
|
|
* Options for Specifying Filenames::
|
|
* Options Affecting Scanner Behavior::
|
|
* Code-Level And API Options::
|
|
* Options for Scanner Speed and Size::
|
|
* Debugging Options::
|
|
* Miscellaneous Options::
|
|
@end menu
|
|
|
|
Even though there are many scanner options, a typical scanner might only
|
|
specify the following options:
|
|
|
|
@example
|
|
@verbatim
|
|
%option 8bit reentrant bison-bridge
|
|
%option warn nodefault
|
|
%option yylineno
|
|
%option outfile="scanner.c" header-file="scanner.h"
|
|
@end verbatim
|
|
@end example
|
|
|
|
The first line specifies the general type of scanner we want. The second line
|
|
specifies that we are being careful. The third line asks flex to track line
|
|
numbers. The last line tells flex what to name the files. (The options can be
|
|
specified in any order. We just divided them.)
|
|
|
|
@code{flex} also provides a mechanism for controlling options within the
|
|
scanner specification itself, rather than from the flex command-line.
|
|
This is done by including @code{%option} directives in the first section
|
|
of the scanner specification. You can specify multiple options with a
|
|
single @code{%option} directive, and multiple directives in the first
|
|
section of your flex input file.
|
|
|
|
Most options are given simply as names, optionally preceded by the
|
|
word @samp{no} (with no intervening whitespace) to negate their meaning.
|
|
The names are the same as their long-option equivalents (but without the
|
|
leading @samp{--} ).
|
|
|
|
@code{flex} scans your rule actions to determine whether you use the
|
|
@code{REJECT} or @code{yymore()} features. The @code{REJECT} and
|
|
@code{yymore} options are available to override its decision as to
|
|
whether you use the options, either by setting them (e.g., @code{%option
|
|
reject)} to indicate the feature is indeed used, or unsetting them to
|
|
indicate it actually is not used (e.g., @code{%option noyymore)}.
|
|
|
|
|
|
A number of options are available for lint purists who want to suppress
|
|
the appearance of unneeded routines in the generated scanner. Each of
|
|
the following, if unset (e.g., @code{%option nounput}), results in the
|
|
corresponding routine not appearing in the generated scanner:
|
|
|
|
@example
|
|
@verbatim
|
|
input, unput
|
|
yy_push_state, yy_pop_state, yy_top_state
|
|
yy_scan_buffer, yy_scan_bytes, yy_scan_string
|
|
|
|
yyget_extra, yyset_extra, yyget_leng, yyget_text,
|
|
yyget_lineno, yyset_lineno, yyget_in, yyset_in,
|
|
yyget_out, yyset_out, yyget_lval, yyset_lval,
|
|
yyget_lloc, yyset_lloc, yyget_debug, yyset_debug
|
|
@end verbatim
|
|
@end example
|
|
|
|
(though @code{yy_push_state()} and friends won't appear anyway unless
|
|
you use @code{%option stack)}.
|
|
|
|
@node Options for Specifying Filenames, Options Affecting Scanner Behavior, Scanner Options, Scanner Options
|
|
@section Options for Specifying Filenames
|
|
|
|
@table @samp
|
|
|
|
@anchor{option-header}
|
|
@opindex ---header-file
|
|
@opindex header-file
|
|
@item --header-file=FILE, @code{%option header-file="FILE"}
|
|
instructs flex to write a C header to @file{FILE}. This file contains
|
|
function prototypes, extern variables, and types used by the scanner.
|
|
Only the external API is exported by the header file. Many macros that
|
|
are usable from within scanner actions are not exported to the header
|
|
file. This is due to namespace problems and the goal of a clean
|
|
external API.
|
|
|
|
While in the header, the macro @code{yyIN_HEADER} is defined, where @samp{yy}
|
|
is substituted with the appropriate prefix.
|
|
|
|
The @samp{--header-file} option is not compatible with the @samp{--c++} option,
|
|
since the C++ scanner provides its own header in @file{yyFlexLexer.h}.
|
|
|
|
|
|
|
|
@anchor{option-outfile}
|
|
@opindex -o
|
|
@opindex ---outfile
|
|
@opindex outfile
|
|
@item -oFILE, --outfile=FILE, @code{%option outfile="FILE"}
|
|
directs flex to write the scanner to the file @file{FILE} instead of
|
|
@file{lex.yy.c}. If you combine @samp{--outfile} with the @samp{--stdout} option,
|
|
then the scanner is written to @file{stdout} but its @code{#line}
|
|
directives (see the @samp{-l} option above) refer to the file
|
|
@file{FILE}.
|
|
|
|
|
|
|
|
@anchor{option-stdout}
|
|
@opindex -t
|
|
@opindex ---stdout
|
|
@opindex stdout
|
|
@item -t, --stdout, @code{%option stdout}
|
|
instructs @code{flex} to write the scanner it generates to standard
|
|
output instead of @file{lex.yy.c}.
|
|
|
|
|
|
|
|
@opindex ---skel
|
|
@item -SFILE, --skel=FILE
|
|
overrides the default skeleton file from which
|
|
@code{flex}
|
|
constructs its scanners. You'll never need this option unless you are doing
|
|
@code{flex}
|
|
maintenance or development.
|
|
|
|
@opindex ---tables-file
|
|
@opindex tables-file
|
|
@item --tables-file=FILE
|
|
Write serialized scanner dfa tables to FILE. The generated scanner will not
|
|
contain the tables, and requires them to be loaded at runtime.
|
|
@xref{serialization}.
|
|
|
|
@opindex ---tables-verify
|
|
@opindex tables-verify
|
|
@item --tables-verify
|
|
This option is for flex development. We document it here in case you stumble
|
|
upon it by accident or in case you suspect some inconsistency in the serialized
|
|
tables. Flex will serialize the scanner dfa tables but will also generate the
|
|
in-code tables as it normally does. At runtime, the scanner will verify that
|
|
the serialized tables match the in-code tables, instead of loading them.
|
|
|
|
@end table
|
|
|
|
@node Options Affecting Scanner Behavior, Code-Level And API Options, Options for Specifying Filenames, Scanner Options
|
|
@section Options Affecting Scanner Behavior
|
|
|
|
@table @samp
|
|
@anchor{option-case-insensitive}
|
|
@opindex -i
|
|
@opindex ---case-insensitive
|
|
@opindex case-insensitive
|
|
@item -i, --case-insensitive, @code{%option case-insensitive}
|
|
instructs @code{flex} to generate a @dfn{case-insensitive} scanner. The
|
|
case of letters given in the @code{flex} input patterns will be ignored,
|
|
and tokens in the input will be matched regardless of case. The matched
|
|
text given in @code{yytext} will have the preserved case (i.e., it will
|
|
not be folded). For tricky behavior, see @ref{case and character ranges}.
|
|
|
|
|
|
|
|
@anchor{option-lex-compat}
|
|
@opindex -l
|
|
@opindex ---lex-compat
|
|
@opindex lex-compat
|
|
@item -l, --lex-compat, @code{%option lex-compat}
|
|
turns on maximum compatibility with the original AT&T @code{lex}
|
|
implementation. Note that this does not mean @emph{full} compatibility.
|
|
Use of this option costs a considerable amount of performance, and it
|
|
cannot be used with the @samp{--c++}, @samp{--full}, @samp{--fast}, @samp{-Cf}, or
|
|
@samp{-CF} options. For details on the compatibilities it provides, see
|
|
@ref{Lex and Posix}. This option also results in the name
|
|
@code{YY_FLEX_LEX_COMPAT} being @code{#define}'d in the generated scanner.
|
|
|
|
|
|
|
|
@anchor{option-batch}
|
|
@opindex -B
|
|
@opindex ---batch
|
|
@opindex batch
|
|
@item -B, --batch, @code{%option batch}
|
|
instructs @code{flex} to generate a @dfn{batch} scanner, the opposite of
|
|
@emph{interactive} scanners generated by @samp{--interactive} (see below). In
|
|
general, you use @samp{-B} when you are @emph{certain} that your scanner
|
|
will never be used interactively, and you want to squeeze a
|
|
@emph{little} more performance out of it. If your goal is instead to
|
|
squeeze out a @emph{lot} more performance, you should be using the
|
|
@samp{-Cf} or @samp{-CF} options, which turn on @samp{--batch} automatically
|
|
anyway.
|
|
|
|
|
|
|
|
@anchor{option-interactive}
|
|
@opindex -I
|
|
@opindex ---interactive
|
|
@opindex interactive
|
|
@item -I, --interactive, @code{%option interactive}
|
|
instructs @code{flex} to generate an @i{interactive} scanner. An
|
|
interactive scanner is one that only looks ahead to decide what token
|
|
has been matched if it absolutely must. It turns out that always
|
|
looking one extra character ahead, even if the scanner has already seen
|
|
enough text to disambiguate the current token, is a bit faster than only
|
|
looking ahead when necessary. But scanners that always look ahead give
|
|
dreadful interactive performance; for example, when a user types a
|
|
newline, it is not recognized as a newline token until they enter
|
|
@emph{another} token, which often means typing in another whole line.
|
|
|
|
@code{flex} scanners default to @code{interactive} unless you use the
|
|
@samp{-Cf} or @samp{-CF} table-compression options
|
|
(@pxref{Performance}). That's because if you're looking for
|
|
high-performance you should be using one of these options, so if you
|
|
didn't, @code{flex} assumes you'd rather trade off a bit of run-time
|
|
performance for intuitive interactive behavior. Note also that you
|
|
@emph{cannot} use @samp{--interactive} in conjunction with @samp{-Cf} or
|
|
@samp{-CF}. Thus, this option is not really needed; it is on by default
|
|
for all those cases in which it is allowed.
|
|
|
|
You can force a scanner to
|
|
@emph{not}
|
|
be interactive by using
|
|
@samp{--batch}
|
|
|
|
|
|
|
|
@anchor{option-7bit}
|
|
@opindex -7
|
|
@opindex ---7bit
|
|
@opindex 7bit
|
|
@item -7, --7bit, @code{%option 7bit}
|
|
instructs @code{flex} to generate a 7-bit scanner, i.e., one which can
|
|
only recognize 7-bit characters in its input. The advantage of using
|
|
@samp{--7bit} is that the scanner's tables can be up to half the size of
|
|
those generated using the @samp{--8bit}. The disadvantage is that such
|
|
scanners often hang or crash if their input contains an 8-bit character.
|
|
|
|
Note, however, that unless you generate your scanner using the
|
|
@samp{-Cf} or @samp{-CF} table compression options, use of @samp{--7bit}
|
|
will save only a small amount of table space, and make your scanner
|
|
considerably less portable. @code{Flex}'s default behavior is to
|
|
generate an 8-bit scanner unless you use the @samp{-Cf} or @samp{-CF},
|
|
in which case @code{flex} defaults to generating 7-bit scanners unless
|
|
your site was always configured to generate 8-bit scanners (as will
|
|
often be the case with non-USA sites). You can tell whether flex
|
|
generated a 7-bit or an 8-bit scanner by inspecting the flag summary in
|
|
the @samp{--verbose} output as described above.
|
|
|
|
Note that if you use @samp{-Cfe} or @samp{-CFe} @code{flex} still
|
|
defaults to generating an 8-bit scanner, since usually with these
|
|
compression options full 8-bit tables are not much more expensive than
|
|
7-bit tables.
|
|
|
|
|
|
|
|
@anchor{option-8bit}
|
|
@opindex -8
|
|
@opindex ---8bit
|
|
@opindex 8bit
|
|
@item -8, --8bit, @code{%option 8bit}
|
|
instructs @code{flex} to generate an 8-bit scanner, i.e., one which can
|
|
recognize 8-bit characters. This flag is only needed for scanners
|
|
generated using @samp{-Cf} or @samp{-CF}, as otherwise flex defaults to
|
|
generating an 8-bit scanner anyway.
|
|
|
|
See the discussion of
|
|
@samp{--7bit}
|
|
above for @code{flex}'s default behavior and the tradeoffs between 7-bit
|
|
and 8-bit scanners.
|
|
|
|
|
|
|
|
@anchor{option-default}
|
|
@opindex ---default
|
|
@opindex default
|
|
@item --default, @code{%option default}
|
|
generate the default rule.
|
|
|
|
|
|
|
|
@anchor{option-always-interactive}
|
|
@opindex ---always-interactive
|
|
@opindex always-interactive
|
|
@item --always-interactive, @code{%option always-interactive}
|
|
instructs flex to generate a scanner which always considers its input
|
|
@emph{interactive}. Normally, on each new input file the scanner calls
|
|
@code{isatty()} in an attempt to determine whether the scanner's input
|
|
source is interactive and thus should be read a character at a time.
|
|
When this option is used, however, then no such call is made.
|
|
|
|
|
|
|
|
@opindex ---never-interactive
|
|
@item --never-interactive, @code{--never-interactive}
|
|
instructs flex to generate a scanner which never considers its input
|
|
interactive. This is the opposite of @code{always-interactive}.
|
|
|
|
|
|
@anchor{option-posix}
|
|
@opindex -X
|
|
@opindex ---posix
|
|
@opindex posix
|
|
@item -X, --posix, @code{%option posix}
|
|
turns on maximum compatibility with the POSIX 1003.2-1992 definition of
|
|
@code{lex}. Since @code{flex} was originally designed to implement the
|
|
POSIX definition of @code{lex} this generally involves very few changes
|
|
in behavior. At the current writing the known differences between
|
|
@code{flex} and the POSIX standard are:
|
|
|
|
@itemize
|
|
@item
|
|
In POSIX and AT&T @code{lex}, the repeat operator, @samp{@{@}}, has lower
|
|
precedence than concatenation (thus @samp{ab@{3@}} yields @samp{ababab}).
|
|
Most POSIX utilities use an Extended Regular Expression (ERE) precedence
|
|
that has the precedence of the repeat operator higher than concatenation
|
|
(which causes @samp{ab@{3@}} to yield @samp{abbb}). By default, @code{flex}
|
|
places the precedence of the repeat operator higher than concatenation
|
|
which matches the ERE processing of other POSIX utilities. When either
|
|
@samp{--posix} or @samp{-l} are specified, @code{flex} will use the
|
|
traditional AT&T and POSIX-compliant precedence for the repeat operator
|
|
where concatenation has higher precedence than the repeat operator.
|
|
@end itemize
|
|
|
|
|
|
@anchor{option-stack}
|
|
@opindex ---stack
|
|
@opindex stack
|
|
@item --stack, @code{%option stack}
|
|
enables the use of
|
|
start condition stacks (@pxref{Start Conditions}).
|
|
|
|
|
|
|
|
@anchor{option-stdinit}
|
|
@opindex ---stdinit
|
|
@opindex stdinit
|
|
@item --stdinit, @code{%option stdinit}
|
|
if set (i.e., @b{%option stdinit)} initializes @code{yyin} and
|
|
@code{yyout} to @file{stdin} and @file{stdout}, instead of the default of
|
|
@file{NULL}. Some existing @code{lex} programs depend on this behavior,
|
|
even though it is not compliant with ANSI C, which does not require
|
|
@file{stdin} and @file{stdout} to be compile-time constant. In a
|
|
reentrant scanner, however, this is not a problem since initialization
|
|
is performed in @code{yylex_init} at runtime.
|
|
|
|
|
|
|
|
@anchor{option-yylineno}
|
|
@opindex ---yylineno
|
|
@opindex yylineno
|
|
@item --yylineno, @code{%option yylineno}
|
|
directs @code{flex} to generate a scanner
|
|
that maintains the number of the current line read from its input in the
|
|
global variable @code{yylineno}. This option is implied by @code{%option
|
|
lex-compat}. In a reentrant C scanner, the macro @code{yylineno} is
|
|
accessible regardless of the value of @code{%option yylineno}, however, its
|
|
value is not modified by @code{flex} unless @code{%option yylineno} is enabled.
|
|
|
|
|
|
|
|
@anchor{option-yywrap}
|
|
@opindex ---yywrap
|
|
@opindex yywrap
|
|
@item --yywrap, @code{%option yywrap}
|
|
if unset (i.e., @code{--noyywrap)}, makes the scanner not call
|
|
@code{yywrap()} upon an end-of-file, but simply assume that there are no
|
|
more files to scan (until the user points @file{yyin} at a new file and
|
|
calls @code{yylex()} again).
|
|
|
|
@end table
|
|
|
|
@node Code-Level And API Options, Options for Scanner Speed and Size, Options Affecting Scanner Behavior, Scanner Options
|
|
@section Code-Level And API Options
|
|
|
|
@table @samp
|
|
|
|
@anchor{option-ansi-definitions}
|
|
@opindex ---option-ansi-definitions
|
|
@opindex ansi-definitions
|
|
@item --ansi-definitions, @code{%option ansi-definitions}
|
|
instruct flex to generate ANSI C99 definitions for functions.
|
|
This option is enabled by default.
|
|
If @code{%option noansi-definitions} is specified, then the obsolete style
|
|
is generated.
|
|
|
|
@anchor{option-ansi-prototypes}
|
|
@opindex ---option-ansi-prototypes
|
|
@opindex ansi-prototypes
|
|
@item --ansi-prototypes, @code{%option ansi-prototypes}
|
|
instructs flex to generate ANSI C99 prototypes for functions.
|
|
This option is enabled by default.
|
|
If @code{noansi-prototypes} is specified, then
|
|
prototypes will have empty parameter lists.
|
|
|
|
@anchor{option-bison-bridge}
|
|
@opindex ---bison-bridge
|
|
@opindex bison-bridge
|
|
@item --bison-bridge, @code{%option bison-bridge}
|
|
instructs flex to generate a C scanner that is
|
|
meant to be called by a
|
|
@code{GNU bison}
|
|
parser. The scanner has minor API changes for
|
|
@code{bison}
|
|
compatibility. In particular, the declaration of
|
|
@code{yylex}
|
|
is modified to take an additional parameter,
|
|
@code{yylval}.
|
|
@xref{Bison Bridge}.
|
|
|
|
@anchor{option-bison-locations}
|
|
@opindex ---bison-locations
|
|
@opindex bison-locations
|
|
@item --bison-locations, @code{%option bison-locations}
|
|
instruct flex that
|
|
@code{GNU bison} @code{%locations} are being used.
|
|
This means @code{yylex} will be passed
|
|
an additional parameter, @code{yylloc}. This option
|
|
implies @code{%option bison-bridge}.
|
|
@xref{Bison Bridge}.
|
|
|
|
@anchor{option-noline}
|
|
@opindex -L
|
|
@opindex ---noline
|
|
@opindex noline
|
|
@item -L, --noline, @code{%option noline}
|
|
instructs
|
|
@code{flex}
|
|
not to generate
|
|
@code{#line}
|
|
directives. Without this option,
|
|
@code{flex}
|
|
peppers the generated scanner
|
|
with @code{#line} directives so error messages in the actions will be correctly
|
|
located with respect to either the original
|
|
@code{flex}
|
|
input file (if the errors are due to code in the input file), or
|
|
@file{lex.yy.c}
|
|
(if the errors are
|
|
@code{flex}'s
|
|
fault -- you should report these sorts of errors to the email address
|
|
given in @ref{Reporting Bugs}).
|
|
|
|
|
|
|
|
@anchor{option-reentrant}
|
|
@opindex -R
|
|
@opindex ---reentrant
|
|
@opindex reentrant
|
|
@item -R, --reentrant, @code{%option reentrant}
|
|
instructs flex to generate a reentrant C scanner. The generated scanner
|
|
may safely be used in a multi-threaded environment. The API for a
|
|
reentrant scanner is different than for a non-reentrant scanner
|
|
@pxref{Reentrant}). Because of the API difference between
|
|
reentrant and non-reentrant @code{flex} scanners, non-reentrant flex
|
|
code must be modified before it is suitable for use with this option.
|
|
This option is not compatible with the @samp{--c++} option.
|
|
|
|
The option @samp{--reentrant} does not affect the performance of
|
|
the scanner.
|
|
|
|
|
|
|
|
@anchor{option-c++}
|
|
@opindex -+
|
|
@opindex ---c++
|
|
@opindex c++
|
|
@item -+, --c++, @code{%option c++}
|
|
specifies that you want flex to generate a C++
|
|
scanner class. @xref{Cxx}, for
|
|
details.
|
|
|
|
|
|
|
|
@anchor{option-array}
|
|
@opindex ---array
|
|
@opindex array
|
|
@item --array, @code{%option array}
|
|
specifies that you want yytext to be an array instead of a char*
|
|
|
|
|
|
|
|
@anchor{option-pointer}
|
|
@opindex ---pointer
|
|
@opindex pointer
|
|
@item --pointer, @code{%option pointer}
|
|
specify that @code{yytext} should be a @code{char *}, not an array.
|
|
This default is @code{char *}.
|
|
|
|
|
|
|
|
@anchor{option-prefix}
|
|
@opindex -P
|
|
@opindex ---prefix
|
|
@opindex prefix
|
|
@item -PPREFIX, --prefix=PREFIX, @code{%option prefix="PREFIX"}
|
|
changes the default @samp{yy} prefix used by @code{flex} for all
|
|
globally-visible variable and function names to instead be
|
|
@samp{PREFIX}. For example, @samp{--prefix=foo} changes the name of
|
|
@code{yytext} to @code{footext}. It also changes the name of the default
|
|
output file from @file{lex.yy.c} to @file{lex.foo.c}. Here is a partial
|
|
list of the names affected:
|
|
|
|
@example
|
|
@verbatim
|
|
yy_create_buffer
|
|
yy_delete_buffer
|
|
yy_flex_debug
|
|
yy_init_buffer
|
|
yy_flush_buffer
|
|
yy_load_buffer_state
|
|
yy_switch_to_buffer
|
|
yyin
|
|
yyleng
|
|
yylex
|
|
yylineno
|
|
yyout
|
|
yyrestart
|
|
yytext
|
|
yywrap
|
|
yyalloc
|
|
yyrealloc
|
|
yyfree
|
|
@end verbatim
|
|
@end example
|
|
|
|
(If you are using a C++ scanner, then only @code{yywrap} and
|
|
@code{yyFlexLexer} are affected.) Within your scanner itself, you can
|
|
still refer to the global variables and functions using either version
|
|
of their name; but externally, they have the modified name.
|
|
|
|
This option lets you easily link together multiple
|
|
@code{flex}
|
|
programs into the same executable. Note, though, that using this
|
|
option also renames
|
|
@code{yywrap()},
|
|
so you now
|
|
@emph{must}
|
|
either
|
|
provide your own (appropriately-named) version of the routine for your
|
|
scanner, or use
|
|
@code{%option noyywrap},
|
|
as linking with
|
|
@samp{-lfl}
|
|
no longer provides one for you by default.
|
|
|
|
|
|
|
|
@anchor{option-main}
|
|
@opindex ---main
|
|
@opindex main
|
|
@item --main, @code{%option main}
|
|
directs flex to provide a default @code{main()} program for the
|
|
scanner, which simply calls @code{yylex()}. This option implies
|
|
@code{noyywrap} (see below).
|
|
|
|
|
|
|
|
@anchor{option-nounistd}
|
|
@opindex ---nounistd
|
|
@opindex nounistd
|
|
@item --nounistd, @code{%option nounistd}
|
|
suppresses inclusion of the non-ANSI header file @file{unistd.h}. This option
|
|
is meant to target environments in which @file{unistd.h} does not exist. Be aware
|
|
that certain options may cause flex to generate code that relies on functions
|
|
normally found in @file{unistd.h}, (e.g. @code{isatty()}, @code{read()}.)
|
|
If you wish to use these functions, you will have to inform your compiler where
|
|
to find them.
|
|
@xref{option-always-interactive}. @xref{option-read}.
|
|
|
|
|
|
|
|
@anchor{option-yyclass}
|
|
@opindex ---yyclass
|
|
@opindex yyclass
|
|
@item --yyclass=NAME, @code{%option yyclass="NAME"}
|
|
only applies when generating a C++ scanner (the @samp{--c++} option). It
|
|
informs @code{flex} that you have derived @code{NAME} as a subclass of
|
|
@code{yyFlexLexer}, so @code{flex} will place your actions in the member
|
|
function @code{foo::yylex()} instead of @code{yyFlexLexer::yylex()}. It
|
|
also generates a @code{yyFlexLexer::yylex()} member function that emits
|
|
a run-time error (by invoking @code{yyFlexLexer::LexerError())} if
|
|
called. @xref{Cxx}.
|
|
|
|
@end table
|
|
|
|
@node Options for Scanner Speed and Size, Debugging Options, Code-Level And API Options, Scanner Options
|
|
@section Options for Scanner Speed and Size
|
|
|
|
@table @samp
|
|
|
|
@item -C[aefFmr]
|
|
controls the degree of table compression and, more generally, trade-offs
|
|
between small scanners and fast scanners.
|
|
|
|
@table @samp
|
|
@opindex -C
|
|
@item -C
|
|
A lone @samp{-C} specifies that the scanner tables should be compressed
|
|
but neither equivalence classes nor meta-equivalence classes should be
|
|
used.
|
|
|
|
@anchor{option-align}
|
|
@opindex -Ca
|
|
@opindex ---align
|
|
@opindex align
|
|
@item -Ca, --align, @code{%option align}
|
|
(``align'') instructs flex to trade off larger tables in the
|
|
generated scanner for faster performance because the elements of
|
|
the tables are better aligned for memory access and computation. On some
|
|
RISC architectures, fetching and manipulating longwords is more efficient
|
|
than with smaller-sized units such as shortwords. This option can
|
|
quadruple the size of the tables used by your scanner.
|
|
|
|
@anchor{option-ecs}
|
|
@opindex -Ce
|
|
@opindex ---ecs
|
|
@opindex ecs
|
|
@item -Ce, --ecs, @code{%option ecs}
|
|
directs @code{flex} to construct @dfn{equivalence classes}, i.e., sets
|
|
of characters which have identical lexical properties (for example, if
|
|
the only appearance of digits in the @code{flex} input is in the
|
|
character class ``[0-9]'' then the digits '0', '1', ..., '9' will all be
|
|
put in the same equivalence class). Equivalence classes usually give
|
|
dramatic reductions in the final table/object file sizes (typically a
|
|
factor of 2-5) and are pretty cheap performance-wise (one array look-up
|
|
per character scanned).
|
|
|
|
@opindex -Cf
|
|
@item -Cf
|
|
specifies that the @dfn{full} scanner tables should be generated -
|
|
@code{flex} should not compress the tables by taking advantages of
|
|
similar transition functions for different states.
|
|
|
|
@opindex -CF
|
|
@item -CF
|
|
specifies that the alternate fast scanner representation (described
|
|
above under the @samp{--fast} flag) should be used. This option cannot be
|
|
used with @samp{--c++}.
|
|
|
|
@anchor{option-meta-ecs}
|
|
@opindex -Cm
|
|
@opindex ---meta-ecs
|
|
@opindex meta-ecs
|
|
@item -Cm, --meta-ecs, @code{%option meta-ecs}
|
|
directs
|
|
@code{flex}
|
|
to construct
|
|
@dfn{meta-equivalence classes},
|
|
which are sets of equivalence classes (or characters, if equivalence
|
|
classes are not being used) that are commonly used together. Meta-equivalence
|
|
classes are often a big win when using compressed tables, but they
|
|
have a moderate performance impact (one or two @code{if} tests and one
|
|
array look-up per character scanned).
|
|
|
|
@anchor{option-read}
|
|
@opindex -Cr
|
|
@opindex ---read
|
|
@opindex read
|
|
@item -Cr, --read, @code{%option read}
|
|
causes the generated scanner to @emph{bypass} use of the standard I/O
|
|
library (@code{stdio}) for input. Instead of calling @code{fread()} or
|
|
@code{getc()}, the scanner will use the @code{read()} system call,
|
|
resulting in a performance gain which varies from system to system, but
|
|
in general is probably negligible unless you are also using @samp{-Cf}
|
|
or @samp{-CF}. Using @samp{-Cr} can cause strange behavior if, for
|
|
example, you read from @file{yyin} using @code{stdio} prior to calling
|
|
the scanner (because the scanner will miss whatever text your previous
|
|
reads left in the @code{stdio} input buffer). @samp{-Cr} has no effect
|
|
if you define @code{YY_INPUT()} (@pxref{Generated Scanner}).
|
|
@end table
|
|
|
|
The options @samp{-Cf} or @samp{-CF} and @samp{-Cm} do not make sense
|
|
together - there is no opportunity for meta-equivalence classes if the
|
|
table is not being compressed. Otherwise the options may be freely
|
|
mixed, and are cumulative.
|
|
|
|
The default setting is @samp{-Cem}, which specifies that @code{flex}
|
|
should generate equivalence classes and meta-equivalence classes. This
|
|
setting provides the highest degree of table compression. You can trade
|
|
off faster-executing scanners at the cost of larger tables with the
|
|
following generally being true:
|
|
|
|
@example
|
|
@verbatim
|
|
slowest & smallest
|
|
-Cem
|
|
-Cm
|
|
-Ce
|
|
-C
|
|
-C{f,F}e
|
|
-C{f,F}
|
|
-C{f,F}a
|
|
fastest & largest
|
|
@end verbatim
|
|
@end example
|
|
|
|
Note that scanners with the smallest tables are usually generated and
|
|
compiled the quickest, so during development you will usually want to
|
|
use the default, maximal compression.
|
|
|
|
@samp{-Cfe} is often a good compromise between speed and size for
|
|
production scanners.
|
|
|
|
@anchor{option-full}
|
|
@opindex -f
|
|
@opindex ---full
|
|
@opindex full
|
|
@item -f, --full, @code{%option full}
|
|
specifies
|
|
@dfn{fast scanner}.
|
|
No table compression is done and @code{stdio} is bypassed.
|
|
The result is large but fast. This option is equivalent to
|
|
@samp{--Cfr}
|
|
|
|
|
|
@anchor{option-fast}
|
|
@opindex -F
|
|
@opindex ---fast
|
|
@opindex fast
|
|
@item -F, --fast, @code{%option fast}
|
|
specifies that the @emph{fast} scanner table representation should be
|
|
used (and @code{stdio} bypassed). This representation is about as fast
|
|
as the full table representation @samp{--full}, and for some sets of
|
|
patterns will be considerably smaller (and for others, larger). In
|
|
general, if the pattern set contains both @emph{keywords} and a
|
|
catch-all, @emph{identifier} rule, such as in the set:
|
|
|
|
@example
|
|
@verbatim
|
|
"case" return TOK_CASE;
|
|
"switch" return TOK_SWITCH;
|
|
...
|
|
"default" return TOK_DEFAULT;
|
|
[a-z]+ return TOK_ID;
|
|
@end verbatim
|
|
@end example
|
|
|
|
then you're better off using the full table representation. If only
|
|
the @emph{identifier} rule is present and you then use a hash table or some such
|
|
to detect the keywords, you're better off using
|
|
@samp{--fast}.
|
|
|
|
This option is equivalent to @samp{-CFr}. It cannot be used
|
|
with @samp{--c++}.
|
|
|
|
@end table
|
|
|
|
@node Debugging Options, Miscellaneous Options, Options for Scanner Speed and Size, Scanner Options
|
|
@section Debugging Options
|
|
|
|
@table @samp
|
|
|
|
@anchor{option-backup}
|
|
@opindex -b
|
|
@opindex ---backup
|
|
@opindex backup
|
|
@item -b, --backup, @code{%option backup}
|
|
Generate backing-up information to @file{lex.backup}. This is a list of
|
|
scanner states which require backing up and the input characters on
|
|
which they do so. By adding rules one can remove backing-up states. If
|
|
@emph{all} backing-up states are eliminated and @samp{-Cf} or @code{-CF}
|
|
is used, the generated scanner will run faster (see the @samp{--perf-report} flag).
|
|
Only users who wish to squeeze every last cycle out of their scanners
|
|
need worry about this option. (@pxref{Performance}).
|
|
|
|
|
|
|
|
@anchor{option-debug}
|
|
@opindex -d
|
|
@opindex ---debug
|
|
@opindex debug
|
|
@item -d, --debug, @code{%option debug}
|
|
makes the generated scanner run in @dfn{debug} mode. Whenever a pattern
|
|
is recognized and the global variable @code{yy_flex_debug} is non-zero
|
|
(which is the default), the scanner will write to @file{stderr} a line
|
|
of the form:
|
|
|
|
@example
|
|
@verbatim
|
|
-accepting rule at line 53 ("the matched text")
|
|
@end verbatim
|
|
@end example
|
|
|
|
The line number refers to the location of the rule in the file defining
|
|
the scanner (i.e., the file that was fed to flex). Messages are also
|
|
generated when the scanner backs up, accepts the default rule, reaches
|
|
the end of its input buffer (or encounters a NUL; at this point, the two
|
|
look the same as far as the scanner's concerned), or reaches an
|
|
end-of-file.
|
|
|
|
|
|
|
|
@anchor{option-perf-report}
|
|
@opindex -p
|
|
@opindex ---perf-report
|
|
@opindex perf-report
|
|
@item -p, --perf-report, @code{%option perf-report}
|
|
generates a performance report to @file{stderr}. The report consists of
|
|
comments regarding features of the @code{flex} input file which will
|
|
cause a serious loss of performance in the resulting scanner. If you
|
|
give the flag twice, you will also get comments regarding features that
|
|
lead to minor performance losses.
|
|
|
|
Note that the use of @code{REJECT}, and
|
|
variable trailing context (@pxref{Limitations}) entails a substantial
|
|
performance penalty; use of @code{yymore()}, the @samp{^} operator, and
|
|
the @samp{--interactive} flag entail minor performance penalties.
|
|
|
|
|
|
|
|
@anchor{option-nodefault}
|
|
@opindex -s
|
|
@opindex ---nodefault
|
|
@opindex nodefault
|
|
@item -s, --nodefault, @code{%option nodefault}
|
|
causes the @emph{default rule} (that unmatched scanner input is echoed
|
|
to @file{stdout)} to be suppressed. If the scanner encounters input
|
|
that does not match any of its rules, it aborts with an error. This
|
|
option is useful for finding holes in a scanner's rule set.
|
|
|
|
|
|
|
|
@anchor{option-trace}
|
|
@opindex -T
|
|
@opindex ---trace
|
|
@opindex trace
|
|
@item -T, --trace, @code{%option trace}
|
|
makes @code{flex} run in @dfn{trace} mode. It will generate a lot of
|
|
messages to @file{stderr} concerning the form of the input and the
|
|
resultant non-deterministic and deterministic finite automata. This
|
|
option is mostly for use in maintaining @code{flex}.
|
|
|
|
|
|
|
|
@anchor{option-nowarn}
|
|
@opindex -w
|
|
@opindex ---nowarn
|
|
@opindex nowarn
|
|
@item -w, --nowarn, @code{%option nowarn}
|
|
suppresses warning messages.
|
|
|
|
|
|
|
|
@anchor{option-verbose}
|
|
@opindex -v
|
|
@opindex ---verbose
|
|
@opindex verbose
|
|
@item -v, --verbose, @code{%option verbose}
|
|
specifies that @code{flex} should write to @file{stderr} a summary of
|
|
statistics regarding the scanner it generates. Most of the statistics
|
|
are meaningless to the casual @code{flex} user, but the first line
|
|
identifies the version of @code{flex} (same as reported by @samp{--version}),
|
|
and the next line the flags used when generating the scanner, including
|
|
those that are on by default.
|
|
|
|
|
|
|
|
@anchor{option-warn}
|
|
@opindex ---warn
|
|
@opindex warn
|
|
@item --warn, @code{%option warn}
|
|
warn about certain things. In particular, if the default rule can be
|
|
matched but no default rule has been given, the flex will warn you.
|
|
We recommend using this option always.
|
|
|
|
@end table
|
|
|
|
@node Miscellaneous Options, , Debugging Options, Scanner Options
|
|
@section Miscellaneous Options
|
|
|
|
@table @samp
|
|
@opindex -c
|
|
@item -c
|
|
A do-nothing option included for POSIX compliance.
|
|
|
|
@opindex -h
|
|
@opindex ---help
|
|
@item -h, -?, --help
|
|
generates a ``help'' summary of @code{flex}'s options to @file{stdout}
|
|
and then exits.
|
|
|
|
@opindex -n
|
|
@item -n
|
|
Another do-nothing option included for
|
|
POSIX compliance.
|
|
|
|
@opindex -V
|
|
@opindex ---version
|
|
@item -V, --version
|
|
prints the version number to @file{stdout} and exits.
|
|
|
|
@end table
|
|
|
|
|
|
@node Performance, Cxx, Scanner Options, Top
|
|
@chapter Performance Considerations
|
|
|
|
@cindex performance, considerations
|
|
The main design goal of @code{flex} is that it generate high-performance
|
|
scanners. It has been optimized for dealing well with large sets of
|
|
rules. Aside from the effects on scanner speed of the table compression
|
|
@samp{-C} options outlined above, there are a number of options/actions
|
|
which degrade performance. These are, from most expensive to least:
|
|
|
|
@cindex REJECT, performance costs
|
|
@cindex yylineno, performance costs
|
|
@cindex trailing context, performance costs
|
|
@example
|
|
@verbatim
|
|
REJECT
|
|
arbitrary trailing context
|
|
|
|
pattern sets that require backing up
|
|
%option yylineno
|
|
%array
|
|
|
|
%option interactive
|
|
%option always-interactive
|
|
|
|
^ beginning-of-line operator
|
|
yymore()
|
|
@end verbatim
|
|
@end example
|
|
|
|
with the first two all being quite expensive and the last two being
|
|
quite cheap. Note also that @code{unput()} is implemented as a routine
|
|
call that potentially does quite a bit of work, while @code{yyless()} is
|
|
a quite-cheap macro. So if you are just putting back some excess text
|
|
you scanned, use @code{yyless()}.
|
|
|
|
@code{REJECT} should be avoided at all costs when performance is
|
|
important. It is a particularly expensive option.
|
|
|
|
There is one case when @code{%option yylineno} can be expensive. That is when
|
|
your patterns match long tokens that could @emph{possibly} contain a newline
|
|
character. There is no performance penalty for rules that can not possibly
|
|
match newlines, since flex does not need to check them for newlines. In
|
|
general, you should avoid rules such as @code{[^f]+}, which match very long
|
|
tokens, including newlines, and may possibly match your entire file! A better
|
|
approach is to separate @code{[^f]+} into two rules:
|
|
|
|
@example
|
|
@verbatim
|
|
%option yylineno
|
|
%%
|
|
[^f\n]+
|
|
\n+
|
|
@end verbatim
|
|
@end example
|
|
|
|
The above scanner does not incur a performance penalty.
|
|
|
|
@cindex patterns, tuning for performance
|
|
@cindex performance, backing up
|
|
@cindex backing up, example of eliminating
|
|
Getting rid of backing up is messy and often may be an enormous amount
|
|
of work for a complicated scanner. In principal, one begins by using
|
|
the @samp{-b} flag to generate a @file{lex.backup} file. For example,
|
|
on the input:
|
|
|
|
@cindex backing up, eliminating
|
|
@example
|
|
@verbatim
|
|
%%
|
|
foo return TOK_KEYWORD;
|
|
foobar return TOK_KEYWORD;
|
|
@end verbatim
|
|
@end example
|
|
|
|
the file looks like:
|
|
|
|
@example
|
|
@verbatim
|
|
State #6 is non-accepting -
|
|
associated rule line numbers:
|
|
2 3
|
|
out-transitions: [ o ]
|
|
jam-transitions: EOF [ \001-n p-\177 ]
|
|
|
|
State #8 is non-accepting -
|
|
associated rule line numbers:
|
|
3
|
|
out-transitions: [ a ]
|
|
jam-transitions: EOF [ \001-` b-\177 ]
|
|
|
|
State #9 is non-accepting -
|
|
associated rule line numbers:
|
|
3
|
|
out-transitions: [ r ]
|
|
jam-transitions: EOF [ \001-q s-\177 ]
|
|
|
|
Compressed tables always back up.
|
|
@end verbatim
|
|
@end example
|
|
|
|
The first few lines tell us that there's a scanner state in which it can
|
|
make a transition on an 'o' but not on any other character, and that in
|
|
that state the currently scanned text does not match any rule. The
|
|
state occurs when trying to match the rules found at lines 2 and 3 in
|
|
the input file. If the scanner is in that state and then reads
|
|
something other than an 'o', it will have to back up to find a rule
|
|
which is matched. With a bit of headscratching one can see that this
|
|
must be the state it's in when it has seen @samp{fo}. When this has
|
|
happened, if anything other than another @samp{o} is seen, the scanner
|
|
will have to back up to simply match the @samp{f} (by the default rule).
|
|
|
|
The comment regarding State #8 indicates there's a problem when
|
|
@samp{foob} has been scanned. Indeed, on any character other than an
|
|
@samp{a}, the scanner will have to back up to accept "foo". Similarly,
|
|
the comment for State #9 concerns when @samp{fooba} has been scanned and
|
|
an @samp{r} does not follow.
|
|
|
|
The final comment reminds us that there's no point going to all the
|
|
trouble of removing backing up from the rules unless we're using
|
|
@samp{-Cf} or @samp{-CF}, since there's no performance gain doing so
|
|
with compressed scanners.
|
|
|
|
@cindex error rules, to eliminate backing up
|
|
The way to remove the backing up is to add ``error'' rules:
|
|
|
|
@cindex backing up, eliminating by adding error rules
|
|
@example
|
|
@verbatim
|
|
%%
|
|
foo return TOK_KEYWORD;
|
|
foobar return TOK_KEYWORD;
|
|
|
|
fooba |
|
|
foob |
|
|
fo {
|
|
/* false alarm, not really a keyword */
|
|
return TOK_ID;
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
Eliminating backing up among a list of keywords can also be done using a
|
|
``catch-all'' rule:
|
|
|
|
@cindex backing up, eliminating with catch-all rule
|
|
@example
|
|
@verbatim
|
|
%%
|
|
foo return TOK_KEYWORD;
|
|
foobar return TOK_KEYWORD;
|
|
|
|
[a-z]+ return TOK_ID;
|
|
@end verbatim
|
|
@end example
|
|
|
|
This is usually the best solution when appropriate.
|
|
|
|
Backing up messages tend to cascade. With a complicated set of rules
|
|
it's not uncommon to get hundreds of messages. If one can decipher
|
|
them, though, it often only takes a dozen or so rules to eliminate the
|
|
backing up (though it's easy to make a mistake and have an error rule
|
|
accidentally match a valid token. A possible future @code{flex} feature
|
|
will be to automatically add rules to eliminate backing up).
|
|
|
|
It's important to keep in mind that you gain the benefits of eliminating
|
|
backing up only if you eliminate @emph{every} instance of backing up.
|
|
Leaving just one means you gain nothing.
|
|
|
|
@emph{Variable} trailing context (where both the leading and trailing
|
|
parts do not have a fixed length) entails almost the same performance
|
|
loss as @code{REJECT} (i.e., substantial). So when possible a rule
|
|
like:
|
|
|
|
@cindex trailing context, variable length
|
|
@example
|
|
@verbatim
|
|
%%
|
|
mouse|rat/(cat|dog) run();
|
|
@end verbatim
|
|
@end example
|
|
|
|
is better written:
|
|
|
|
@example
|
|
@verbatim
|
|
%%
|
|
mouse/cat|dog run();
|
|
rat/cat|dog run();
|
|
@end verbatim
|
|
@end example
|
|
|
|
or as
|
|
|
|
@example
|
|
@verbatim
|
|
%%
|
|
mouse|rat/cat run();
|
|
mouse|rat/dog run();
|
|
@end verbatim
|
|
@end example
|
|
|
|
Note that here the special '|' action does @emph{not} provide any
|
|
savings, and can even make things worse (@pxref{Limitations}).
|
|
|
|
Another area where the user can increase a scanner's performance (and
|
|
one that's easier to implement) arises from the fact that the longer the
|
|
tokens matched, the faster the scanner will run. This is because with
|
|
long tokens the processing of most input characters takes place in the
|
|
(short) inner scanning loop, and does not often have to go through the
|
|
additional work of setting up the scanning environment (e.g.,
|
|
@code{yytext}) for the action. Recall the scanner for C comments:
|
|
|
|
@cindex performance optimization, matching longer tokens
|
|
@example
|
|
@verbatim
|
|
%x comment
|
|
%%
|
|
int line_num = 1;
|
|
|
|
"/*" BEGIN(comment);
|
|
|
|
<comment>[^*\n]*
|
|
<comment>"*"+[^*/\n]*
|
|
<comment>\n ++line_num;
|
|
<comment>"*"+"/" BEGIN(INITIAL);
|
|
@end verbatim
|
|
@end example
|
|
|
|
This could be sped up by writing it as:
|
|
|
|
@example
|
|
@verbatim
|
|
%x comment
|
|
%%
|
|
int line_num = 1;
|
|
|
|
"/*" BEGIN(comment);
|
|
|
|
<comment>[^*\n]*
|
|
<comment>[^*\n]*\n ++line_num;
|
|
<comment>"*"+[^*/\n]*
|
|
<comment>"*"+[^*/\n]*\n ++line_num;
|
|
<comment>"*"+"/" BEGIN(INITIAL);
|
|
@end verbatim
|
|
@end example
|
|
|
|
Now instead of each newline requiring the processing of another action,
|
|
recognizing the newlines is distributed over the other rules to keep the
|
|
matched text as long as possible. Note that @emph{adding} rules does
|
|
@emph{not} slow down the scanner! The speed of the scanner is
|
|
independent of the number of rules or (modulo the considerations given
|
|
at the beginning of this section) how complicated the rules are with
|
|
regard to operators such as @samp{*} and @samp{|}.
|
|
|
|
@cindex keywords, for performance
|
|
@cindex performance, using keywords
|
|
A final example in speeding up a scanner: suppose you want to scan
|
|
through a file containing identifiers and keywords, one per line
|
|
and with no other extraneous characters, and recognize all the
|
|
keywords. A natural first approach is:
|
|
|
|
@cindex performance optimization, recognizing keywords
|
|
@example
|
|
@verbatim
|
|
%%
|
|
asm |
|
|
auto |
|
|
break |
|
|
... etc ...
|
|
volatile |
|
|
while /* it's a keyword */
|
|
|
|
.|\n /* it's not a keyword */
|
|
@end verbatim
|
|
@end example
|
|
|
|
To eliminate the back-tracking, introduce a catch-all rule:
|
|
|
|
@example
|
|
@verbatim
|
|
%%
|
|
asm |
|
|
auto |
|
|
break |
|
|
... etc ...
|
|
volatile |
|
|
while /* it's a keyword */
|
|
|
|
[a-z]+ |
|
|
.|\n /* it's not a keyword */
|
|
@end verbatim
|
|
@end example
|
|
|
|
Now, if it's guaranteed that there's exactly one word per line, then we
|
|
can reduce the total number of matches by a half by merging in the
|
|
recognition of newlines with that of the other tokens:
|
|
|
|
@example
|
|
@verbatim
|
|
%%
|
|
asm\n |
|
|
auto\n |
|
|
break\n |
|
|
... etc ...
|
|
volatile\n |
|
|
while\n /* it's a keyword */
|
|
|
|
[a-z]+\n |
|
|
.|\n /* it's not a keyword */
|
|
@end verbatim
|
|
@end example
|
|
|
|
One has to be careful here, as we have now reintroduced backing up
|
|
into the scanner. In particular, while
|
|
@emph{we}
|
|
know that there will never be any characters in the input stream
|
|
other than letters or newlines,
|
|
@code{flex}
|
|
can't figure this out, and it will plan for possibly needing to back up
|
|
when it has scanned a token like @samp{auto} and then the next character
|
|
is something other than a newline or a letter. Previously it would
|
|
then just match the @samp{auto} rule and be done, but now it has no @samp{auto}
|
|
rule, only a @samp{auto\n} rule. To eliminate the possibility of backing up,
|
|
we could either duplicate all rules but without final newlines, or,
|
|
since we never expect to encounter such an input and therefore don't
|
|
how it's classified, we can introduce one more catch-all rule, this
|
|
one which doesn't include a newline:
|
|
|
|
@example
|
|
@verbatim
|
|
%%
|
|
asm\n |
|
|
auto\n |
|
|
break\n |
|
|
... etc ...
|
|
volatile\n |
|
|
while\n /* it's a keyword */
|
|
|
|
[a-z]+\n |
|
|
[a-z]+ |
|
|
.|\n /* it's not a keyword */
|
|
@end verbatim
|
|
@end example
|
|
|
|
Compiled with @samp{-Cf}, this is about as fast as one can get a
|
|
@code{flex} scanner to go for this particular problem.
|
|
|
|
A final note: @code{flex} is slow when matching @code{NUL}s,
|
|
particularly when a token contains multiple @code{NUL}s. It's best to
|
|
write rules which match @emph{short} amounts of text if it's anticipated
|
|
that the text will often include @code{NUL}s.
|
|
|
|
Another final note regarding performance: as mentioned in
|
|
@ref{Matching}, dynamically resizing @code{yytext} to accommodate huge
|
|
tokens is a slow process because it presently requires that the (huge)
|
|
token be rescanned from the beginning. Thus if performance is vital,
|
|
you should attempt to match ``large'' quantities of text but not
|
|
``huge'' quantities, where the cutoff between the two is at about 8K
|
|
characters per token.
|
|
|
|
@node Cxx, Reentrant, Performance, Top
|
|
@chapter Generating C++ Scanners
|
|
|
|
@cindex c++, experimental form of scanner class
|
|
@cindex experimental form of c++ scanner class
|
|
@strong{IMPORTANT}: the present form of the scanning class is @emph{experimental}
|
|
and may change considerably between major releases.
|
|
|
|
@cindex C++
|
|
@cindex member functions, C++
|
|
@cindex methods, c++
|
|
@code{flex} provides two different ways to generate scanners for use
|
|
with C++. The first way is to simply compile a scanner generated by
|
|
@code{flex} using a C++ compiler instead of a C compiler. You should
|
|
not encounter any compilation errors (@pxref{Reporting Bugs}). You can
|
|
then use C++ code in your rule actions instead of C code. Note that the
|
|
default input source for your scanner remains @file{yyin}, and default
|
|
echoing is still done to @file{yyout}. Both of these remain @code{FILE
|
|
*} variables and not C++ @emph{streams}.
|
|
|
|
You can also use @code{flex} to generate a C++ scanner class, using the
|
|
@samp{-+} option (or, equivalently, @code{%option c++)}, which is
|
|
automatically specified if the name of the @code{flex} executable ends
|
|
in a '+', such as @code{flex++}. When using this option, @code{flex}
|
|
defaults to generating the scanner to the file @file{lex.yy.cc} instead
|
|
of @file{lex.yy.c}. The generated scanner includes the header file
|
|
@file{FlexLexer.h}, which defines the interface to two C++ classes.
|
|
|
|
The first class,
|
|
@code{FlexLexer},
|
|
provides an abstract base class defining the general scanner class
|
|
interface. It provides the following member functions:
|
|
|
|
@table @code
|
|
@findex YYText (C++ only)
|
|
@item const char* YYText()
|
|
returns the text of the most recently matched token, the equivalent of
|
|
@code{yytext}.
|
|
|
|
@findex YYLeng (C++ only)
|
|
@item int YYLeng()
|
|
returns the length of the most recently matched token, the equivalent of
|
|
@code{yyleng}.
|
|
|
|
@findex lineno (C++ only)
|
|
@item int lineno() const
|
|
returns the current input line number (see @code{%option yylineno)}, or
|
|
@code{1} if @code{%option yylineno} was not used.
|
|
|
|
@findex set_debug (C++ only)
|
|
@item void set_debug( int flag )
|
|
sets the debugging flag for the scanner, equivalent to assigning to
|
|
@code{yy_flex_debug} (@pxref{Scanner Options}). Note that you must build
|
|
the scanner using @code{%option debug} to include debugging information
|
|
in it.
|
|
|
|
@findex debug (C++ only)
|
|
@item int debug() const
|
|
returns the current setting of the debugging flag.
|
|
@end table
|
|
|
|
Also provided are member functions equivalent to
|
|
@code{yy_switch_to_buffer()}, @code{yy_create_buffer()} (though the
|
|
first argument is an @code{istream*} object pointer and not a
|
|
@code{FILE*)}, @code{yy_flush_buffer()}, @code{yy_delete_buffer()}, and
|
|
@code{yyrestart()} (again, the first argument is a @code{istream*}
|
|
object pointer).
|
|
|
|
@tindex yyFlexLexer (C++ only)
|
|
@tindex FlexLexer (C++ only)
|
|
The second class defined in @file{FlexLexer.h} is @code{yyFlexLexer},
|
|
which is derived from @code{FlexLexer}. It defines the following
|
|
additional member functions:
|
|
|
|
@table @code
|
|
@findex yyFlexLexer constructor (C++ only)
|
|
@item yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )
|
|
constructs a @code{yyFlexLexer} object using the given streams for input
|
|
and output. If not specified, the streams default to @code{cin} and
|
|
@code{cout}, respectively.
|
|
|
|
@findex yylex (C++ version)
|
|
@item virtual int yylex()
|
|
performs the same role is @code{yylex()} does for ordinary @code{flex}
|
|
scanners: it scans the input stream, consuming tokens, until a rule's
|
|
action returns a value. If you derive a subclass @code{S} from
|
|
@code{yyFlexLexer} and want to access the member functions and variables
|
|
of @code{S} inside @code{yylex()}, then you need to use @code{%option
|
|
yyclass="S"} to inform @code{flex} that you will be using that subclass
|
|
instead of @code{yyFlexLexer}. In this case, rather than generating
|
|
@code{yyFlexLexer::yylex()}, @code{flex} generates @code{S::yylex()}
|
|
(and also generates a dummy @code{yyFlexLexer::yylex()} that calls
|
|
@code{yyFlexLexer::LexerError()} if called).
|
|
|
|
@findex switch_streams (C++ only)
|
|
@item virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0)
|
|
reassigns @code{yyin} to @code{new_in} (if non-null) and @code{yyout} to
|
|
@code{new_out} (if non-null), deleting the previous input buffer if
|
|
@code{yyin} is reassigned.
|
|
|
|
@item int yylex( istream* new_in, ostream* new_out = 0 )
|
|
first switches the input streams via @code{switch_streams( new_in,
|
|
new_out )} and then returns the value of @code{yylex()}.
|
|
@end table
|
|
|
|
In addition, @code{yyFlexLexer} defines the following protected virtual
|
|
functions which you can redefine in derived classes to tailor the
|
|
scanner:
|
|
|
|
@table @code
|
|
@findex LexerInput (C++ only)
|
|
@item virtual int LexerInput( char* buf, int max_size )
|
|
reads up to @code{max_size} characters into @code{buf} and returns the
|
|
number of characters read. To indicate end-of-input, return 0
|
|
characters. Note that @code{interactive} scanners (see the @samp{-B}
|
|
and @samp{-I} flags in @ref{Scanner Options}) define the macro
|
|
@code{YY_INTERACTIVE}. If you redefine @code{LexerInput()} and need to
|
|
take different actions depending on whether or not the scanner might be
|
|
scanning an interactive input source, you can test for the presence of
|
|
this name via @code{#ifdef} statements.
|
|
|
|
@findex LexerOutput (C++ only)
|
|
@item virtual void LexerOutput( const char* buf, int size )
|
|
writes out @code{size} characters from the buffer @code{buf}, which, while
|
|
@code{NUL}-terminated, may also contain internal @code{NUL}s if the
|
|
scanner's rules can match text with @code{NUL}s in them.
|
|
|
|
@cindex error reporting, in C++
|
|
@findex LexerError (C++ only)
|
|
@item virtual void LexerError( const char* msg )
|
|
reports a fatal error message. The default version of this function
|
|
writes the message to the stream @code{cerr} and exits.
|
|
@end table
|
|
|
|
Note that a @code{yyFlexLexer} object contains its @emph{entire}
|
|
scanning state. Thus you can use such objects to create reentrant
|
|
scanners, but see also @ref{Reentrant}. You can instantiate multiple
|
|
instances of the same @code{yyFlexLexer} class, and you can also combine
|
|
multiple C++ scanner classes together in the same program using the
|
|
@samp{-P} option discussed above.
|
|
|
|
Finally, note that the @code{%array} feature is not available to C++
|
|
scanner classes; you must use @code{%pointer} (the default).
|
|
|
|
Here is an example of a simple C++ scanner:
|
|
|
|
@cindex C++ scanners, use of
|
|
@example
|
|
@verbatim
|
|
// An example of using the flex C++ scanner class.
|
|
|
|
%{
|
|
#include <iostream>
|
|
using namespace std;
|
|
int mylineno = 0;
|
|
%}
|
|
|
|
%option noyywrap
|
|
|
|
string \"[^\n"]+\"
|
|
|
|
ws [ \t]+
|
|
|
|
alpha [A-Za-z]
|
|
dig [0-9]
|
|
name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])*
|
|
num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)?
|
|
num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)?
|
|
number {num1}|{num2}
|
|
|
|
%%
|
|
|
|
{ws} /* skip blanks and tabs */
|
|
|
|
"/*" {
|
|
int c;
|
|
|
|
while((c = yyinput()) != 0)
|
|
{
|
|
if(c == '\n')
|
|
++mylineno;
|
|
|
|
else if(c == '*')
|
|
{
|
|
if((c = yyinput()) == '/')
|
|
break;
|
|
else
|
|
unput(c);
|
|
}
|
|
}
|
|
}
|
|
|
|
{number} cout << "number " << YYText() << '\n';
|
|
|
|
\n mylineno++;
|
|
|
|
{name} cout << "name " << YYText() << '\n';
|
|
|
|
{string} cout << "string " << YYText() << '\n';
|
|
|
|
%%
|
|
|
|
int main( int /* argc */, char** /* argv */ )
|
|
{
|
|
FlexLexer* lexer = new yyFlexLexer;
|
|
while(lexer->yylex() != 0)
|
|
;
|
|
return 0;
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
@cindex C++, multiple different scanners
|
|
If you want to create multiple (different) lexer classes, you use the
|
|
@samp{-P} flag (or the @code{prefix=} option) to rename each
|
|
@code{yyFlexLexer} to some other @samp{xxFlexLexer}. You then can
|
|
include @file{<FlexLexer.h>} in your other sources once per lexer class,
|
|
first renaming @code{yyFlexLexer} as follows:
|
|
|
|
@cindex include files, with C++
|
|
@cindex header files, with C++
|
|
@cindex C++ scanners, including multiple scanners
|
|
@example
|
|
@verbatim
|
|
#undef yyFlexLexer
|
|
#define yyFlexLexer xxFlexLexer
|
|
#include <FlexLexer.h>
|
|
|
|
#undef yyFlexLexer
|
|
#define yyFlexLexer zzFlexLexer
|
|
#include <FlexLexer.h>
|
|
@end verbatim
|
|
@end example
|
|
|
|
if, for example, you used @code{%option prefix="xx"} for one of your
|
|
scanners and @code{%option prefix="zz"} for the other.
|
|
|
|
@node Reentrant, Lex and Posix, Cxx, Top
|
|
@chapter Reentrant C Scanners
|
|
|
|
@cindex reentrant, explanation
|
|
@code{flex} has the ability to generate a reentrant C scanner. This is
|
|
accomplished by specifying @code{%option reentrant} (@samp{-R}) The generated
|
|
scanner is both portable, and safe to use in one or more separate threads of
|
|
control. The most common use for reentrant scanners is from within
|
|
multi-threaded applications. Any thread may create and execute a reentrant
|
|
@code{flex} scanner without the need for synchronization with other threads.
|
|
|
|
@menu
|
|
* Reentrant Uses::
|
|
* Reentrant Overview::
|
|
* Reentrant Example::
|
|
* Reentrant Detail::
|
|
* Reentrant Functions::
|
|
@end menu
|
|
|
|
@node Reentrant Uses, Reentrant Overview, Reentrant, Reentrant
|
|
@section Uses for Reentrant Scanners
|
|
|
|
However, there are other uses for a reentrant scanner. For example, you
|
|
could scan two or more files simultaneously to implement a @code{diff} at
|
|
the token level (i.e., instead of at the character level):
|
|
|
|
@cindex reentrant scanners, multiple interleaved scanners
|
|
@example
|
|
@verbatim
|
|
/* Example of maintaining more than one active scanner. */
|
|
|
|
do {
|
|
int tok1, tok2;
|
|
|
|
tok1 = yylex( scanner_1 );
|
|
tok2 = yylex( scanner_2 );
|
|
|
|
if( tok1 != tok2 )
|
|
printf("Files are different.");
|
|
|
|
} while ( tok1 && tok2 );
|
|
@end verbatim
|
|
@end example
|
|
|
|
Another use for a reentrant scanner is recursion.
|
|
(Note that a recursive scanner can also be created using a non-reentrant scanner and
|
|
buffer states. @xref{Multiple Input Buffers}.)
|
|
|
|
The following crude scanner supports the @samp{eval} command by invoking
|
|
another instance of itself.
|
|
|
|
@cindex reentrant scanners, recursive invocation
|
|
@example
|
|
@verbatim
|
|
/* Example of recursive invocation. */
|
|
|
|
%option reentrant
|
|
|
|
%%
|
|
"eval(".+")" {
|
|
yyscan_t scanner;
|
|
YY_BUFFER_STATE buf;
|
|
|
|
yylex_init( &scanner );
|
|
yytext[yyleng-1] = ' ';
|
|
|
|
buf = yy_scan_string( yytext + 5, scanner );
|
|
yylex( scanner );
|
|
|
|
yy_delete_buffer(buf,scanner);
|
|
yylex_destroy( scanner );
|
|
}
|
|
...
|
|
%%
|
|
@end verbatim
|
|
@end example
|
|
|
|
@node Reentrant Overview, Reentrant Example, Reentrant Uses, Reentrant
|
|
@section An Overview of the Reentrant API
|
|
|
|
@cindex reentrant, API explanation
|
|
The API for reentrant scanners is different than for non-reentrant
|
|
scanners. Here is a quick overview of the API:
|
|
|
|
@itemize
|
|
@code{%option reentrant} must be specified.
|
|
|
|
@item
|
|
All functions take one additional argument: @code{yyscanner}
|
|
|
|
@item
|
|
All global variables are replaced by their macro equivalents.
|
|
(We tell you this because it may be important to you during debugging.)
|
|
|
|
@item
|
|
@code{yylex_init} and @code{yylex_destroy} must be called before and
|
|
after @code{yylex}, respectively.
|
|
|
|
@item
|
|
Accessor methods (get/set functions) provide access to common
|
|
@code{flex} variables.
|
|
|
|
@item
|
|
User-specific data can be stored in @code{yyextra}.
|
|
@end itemize
|
|
|
|
@node Reentrant Example, Reentrant Detail, Reentrant Overview, Reentrant
|
|
@section Reentrant Example
|
|
|
|
First, an example of a reentrant scanner:
|
|
@cindex reentrant, example of
|
|
@example
|
|
@verbatim
|
|
/* This scanner prints "//" comments. */
|
|
|
|
%option reentrant stack noyywrap
|
|
%x COMMENT
|
|
|
|
%%
|
|
|
|
"//" yy_push_state( COMMENT, yyscanner);
|
|
.|\n
|
|
|
|
<COMMENT>\n yy_pop_state( yyscanner );
|
|
<COMMENT>[^\n]+ fprintf( yyout, "%s\n", yytext);
|
|
|
|
%%
|
|
|
|
int main ( int argc, char * argv[] )
|
|
{
|
|
yyscan_t scanner;
|
|
|
|
yylex_init ( &scanner );
|
|
yylex ( scanner );
|
|
yylex_destroy ( scanner );
|
|
return 0;
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
@node Reentrant Detail, Reentrant Functions, Reentrant Example, Reentrant
|
|
@section The Reentrant API in Detail
|
|
|
|
Here are the things you need to do or know to use the reentrant C API of
|
|
@code{flex}.
|
|
|
|
@menu
|
|
* Specify Reentrant::
|
|
* Extra Reentrant Argument::
|
|
* Global Replacement::
|
|
* Init and Destroy Functions::
|
|
* Accessor Methods::
|
|
* Extra Data::
|
|
* About yyscan_t::
|
|
@end menu
|
|
|
|
@node Specify Reentrant, Extra Reentrant Argument, Reentrant Detail, Reentrant Detail
|
|
@subsection Declaring a Scanner As Reentrant
|
|
|
|
%option reentrant (--reentrant) must be specified.
|
|
|
|
Notice that @code{%option reentrant} is specified in the above example
|
|
(@pxref{Reentrant Example}. Had this option not been specified,
|
|
@code{flex} would have happily generated a non-reentrant scanner without
|
|
complaining. You may explicitly specify @code{%option noreentrant}, if
|
|
you do @emph{not} want a reentrant scanner, although it is not
|
|
necessary. The default is to generate a non-reentrant scanner.
|
|
|
|
@node Extra Reentrant Argument, Global Replacement, Specify Reentrant, Reentrant Detail
|
|
@subsection The Extra Argument
|
|
|
|
@cindex reentrant, calling functions
|
|
@vindex yyscanner (reentrant only)
|
|
All functions take one additional argument: @code{yyscanner}.
|
|
|
|
Notice that the calls to @code{yy_push_state} and @code{yy_pop_state}
|
|
both have an argument, @code{yyscanner} , that is not present in a
|
|
non-reentrant scanner. Here are the declarations of
|
|
@code{yy_push_state} and @code{yy_pop_state} in the reentrant scanner:
|
|
|
|
@example
|
|
@verbatim
|
|
static void yy_push_state ( int new_state , yyscan_t yyscanner ) ;
|
|
static void yy_pop_state ( yyscan_t yyscanner ) ;
|
|
@end verbatim
|
|
@end example
|
|
|
|
Notice that the argument @code{yyscanner} appears in the declaration of
|
|
both functions. In fact, all @code{flex} functions in a reentrant
|
|
scanner have this additional argument. It is always the last argument
|
|
in the argument list, it is always of type @code{yyscan_t} (which is
|
|
typedef'd to @code{void *}) and it is
|
|
always named @code{yyscanner}. As you may have guessed,
|
|
@code{yyscanner} is a pointer to an opaque data structure encapsulating
|
|
the current state of the scanner. For a list of function declarations,
|
|
see @ref{Reentrant Functions}. Note that preprocessor macros, such as
|
|
@code{BEGIN}, @code{ECHO}, and @code{REJECT}, do not take this
|
|
additional argument.
|
|
|
|
@node Global Replacement, Init and Destroy Functions, Extra Reentrant Argument, Reentrant Detail
|
|
@subsection Global Variables Replaced By Macros
|
|
|
|
@cindex reentrant, accessing flex variables
|
|
All global variables in traditional flex have been replaced by macro equivalents.
|
|
|
|
Note that in the above example, @code{yyout} and @code{yytext} are
|
|
not plain variables. These are macros that will expand to their equivalent lvalue.
|
|
All of the familiar @code{flex} globals have been replaced by their macro
|
|
equivalents. In particular, @code{yytext}, @code{yyleng}, @code{yylineno},
|
|
@code{yyin}, @code{yyout}, @code{yyextra}, @code{yylval}, and @code{yylloc}
|
|
are macros. You may safely use these macros in actions as if they were plain
|
|
variables. We only tell you this so you don't expect to link to these variables
|
|
externally. Currently, each macro expands to a member of an internal struct, e.g.,
|
|
|
|
@example
|
|
@verbatim
|
|
#define yytext (((struct yyguts_t*)yyscanner)->yytext_r)
|
|
@end verbatim
|
|
@end example
|
|
|
|
One important thing to remember about
|
|
@code{yytext}
|
|
and friends is that
|
|
@code{yytext}
|
|
is not a global variable in a reentrant
|
|
scanner, you can not access it directly from outside an action or from
|
|
other functions. You must use an accessor method, e.g.,
|
|
@code{yyget_text},
|
|
to accomplish this. (See below).
|
|
|
|
@node Init and Destroy Functions, Accessor Methods, Global Replacement, Reentrant Detail
|
|
@subsection Init and Destroy Functions
|
|
|
|
@cindex memory, considerations for reentrant scanners
|
|
@cindex reentrant, initialization
|
|
@findex yylex_init
|
|
@findex yylex_destroy
|
|
|
|
@code{yylex_init} and @code{yylex_destroy} must be called before and
|
|
after @code{yylex}, respectively.
|
|
|
|
@example
|
|
@verbatim
|
|
int yylex_init ( yyscan_t * ptr_yy_globals ) ;
|
|
int yylex_init_extra ( YY_EXTRA_TYPE user_defined, yyscan_t * ptr_yy_globals ) ;
|
|
int yylex ( yyscan_t yyscanner ) ;
|
|
int yylex_destroy ( yyscan_t yyscanner ) ;
|
|
@end verbatim
|
|
@end example
|
|
|
|
The function @code{yylex_init} must be called before calling any other
|
|
function. The argument to @code{yylex_init} is the address of an
|
|
uninitialized pointer to be filled in by @code{yylex_init}, overwriting
|
|
any previous contents. The function @code{yylex_init_extra} may be used
|
|
instead, taking as its first argument a variable of type @code{YY_EXTRA_TYPE}.
|
|
See the section on yyextra, below, for more details.
|
|
|
|
The value stored in @code{ptr_yy_globals} should
|
|
thereafter be passed to @code{yylex} and @code{yylex_destroy}. Flex
|
|
does not save the argument passed to @code{yylex_init}, so it is safe to
|
|
pass the address of a local pointer to @code{yylex_init} so long as it remains
|
|
in scope for the duration of all calls to the scanner, up to and including
|
|
the call to @code{yylex_destroy}.
|
|
|
|
The function
|
|
@code{yylex} should be familiar to you by now. The reentrant version
|
|
takes one argument, which is the value returned (via an argument) by
|
|
@code{yylex_init}. Otherwise, it behaves the same as the non-reentrant
|
|
version of @code{yylex}.
|
|
|
|
Both @code{yylex_init} and @code{yylex_init_extra} returns 0 (zero) on success,
|
|
or non-zero on failure, in which case errno is set to one of the following values:
|
|
|
|
@itemize
|
|
@item ENOMEM
|
|
Memory allocation error. @xref{memory-management}.
|
|
@item EINVAL
|
|
Invalid argument.
|
|
@end itemize
|
|
|
|
|
|
The function @code{yylex_destroy} should be
|
|
called to free resources used by the scanner. After @code{yylex_destroy}
|
|
is called, the contents of @code{yyscanner} should not be used. Of
|
|
course, there is no need to destroy a scanner if you plan to reuse it.
|
|
A @code{flex} scanner (both reentrant and non-reentrant) may be
|
|
restarted by calling @code{yyrestart}.
|
|
|
|
Below is an example of a program that creates a scanner, uses it, then destroys
|
|
it when done:
|
|
|
|
@example
|
|
@verbatim
|
|
int main ()
|
|
{
|
|
yyscan_t scanner;
|
|
int tok;
|
|
|
|
yylex_init(&scanner);
|
|
|
|
while ((tok=yylex(scanner)) > 0)
|
|
printf("tok=%d yytext=%s\n", tok, yyget_text(scanner));
|
|
|
|
yylex_destroy(scanner);
|
|
return 0;
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
@node Accessor Methods, Extra Data, Init and Destroy Functions, Reentrant Detail
|
|
@subsection Accessing Variables with Reentrant Scanners
|
|
|
|
@cindex reentrant, accessor functions
|
|
Accessor methods (get/set functions) provide access to common
|
|
@code{flex} variables.
|
|
|
|
Many scanners that you build will be part of a larger project. Portions
|
|
of your project will need access to @code{flex} values, such as
|
|
@code{yytext}. In a non-reentrant scanner, these values are global, so
|
|
there is no problem accessing them. However, in a reentrant scanner, there are no
|
|
global @code{flex} values. You can not access them directly. Instead,
|
|
you must access @code{flex} values using accessor methods (get/set
|
|
functions). Each accessor method is named @code{yyget_NAME} or
|
|
@code{yyset_NAME}, where @code{NAME} is the name of the @code{flex}
|
|
variable you want. For example:
|
|
|
|
@cindex accessor functions, use of
|
|
@example
|
|
@verbatim
|
|
/* Set the last character of yytext to NULL. */
|
|
void chop ( yyscan_t scanner )
|
|
{
|
|
int len = yyget_leng( scanner );
|
|
yyget_text( scanner )[len - 1] = '\0';
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
The above code may be called from within an action like this:
|
|
|
|
@example
|
|
@verbatim
|
|
%%
|
|
.+\n { chop( yyscanner );}
|
|
@end verbatim
|
|
@end example
|
|
|
|
You may find that @code{%option header-file} is particularly useful for generating
|
|
prototypes of all the accessor functions. @xref{option-header}.
|
|
|
|
@node Extra Data, About yyscan_t, Accessor Methods, Reentrant Detail
|
|
@subsection Extra Data
|
|
|
|
@cindex reentrant, extra data
|
|
@vindex yyextra
|
|
User-specific data can be stored in @code{yyextra}.
|
|
|
|
In a reentrant scanner, it is unwise to use global variables to
|
|
communicate with or maintain state between different pieces of your program.
|
|
However, you may need access to external data or invoke external functions
|
|
from within the scanner actions.
|
|
Likewise, you may need to pass information to your scanner
|
|
(e.g., open file descriptors, or database connections).
|
|
In a non-reentrant scanner, the only way to do this would be through the
|
|
use of global variables.
|
|
@code{Flex} allows you to store arbitrary, ``extra'' data in a scanner.
|
|
This data is accessible through the accessor methods
|
|
@code{yyget_extra} and @code{yyset_extra}
|
|
from outside the scanner, and through the shortcut macro
|
|
@code{yyextra}
|
|
from within the scanner itself. They are defined as follows:
|
|
|
|
@tindex YY_EXTRA_TYPE (reentrant only)
|
|
@findex yyget_extra
|
|
@findex yyset_extra
|
|
@example
|
|
@verbatim
|
|
#define YY_EXTRA_TYPE void*
|
|
YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner );
|
|
void yyset_extra ( YY_EXTRA_TYPE arbitrary_data , yyscan_t scanner);
|
|
@end verbatim
|
|
@end example
|
|
|
|
In addition, an extra form of @code{yylex_init} is provided,
|
|
@code{yylex_init_extra}. This function is provided so that the yyextra value can
|
|
be accessed from within the very first yyalloc, used to allocate
|
|
the scanner itself.
|
|
|
|
By default, @code{YY_EXTRA_TYPE} is defined as type @code{void *}. You
|
|
may redefine this type using @code{%option extra-type="your_type"} in
|
|
the scanner:
|
|
|
|
@cindex YY_EXTRA_TYPE, defining your own type
|
|
@example
|
|
@verbatim
|
|
/* An example of overriding YY_EXTRA_TYPE. */
|
|
%{
|
|
#include <sys/stat.h>
|
|
#include <unistd.h>
|
|
%}
|
|
%option reentrant
|
|
%option extra-type="struct stat *"
|
|
%%
|
|
|
|
__filesize__ printf( "%ld", yyextra->st_size );
|
|
__lastmod__ printf( "%ld", yyextra->st_mtime );
|
|
%%
|
|
void scan_file( char* filename )
|
|
{
|
|
yyscan_t scanner;
|
|
struct stat buf;
|
|
FILE *in;
|
|
|
|
in = fopen( filename, "r" );
|
|
stat( filename, &buf );
|
|
|
|
yylex_init_extra( buf, &scanner );
|
|
yyset_in( in, scanner );
|
|
yylex( scanner );
|
|
yylex_destroy( scanner );
|
|
|
|
fclose( in );
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
|
|
@node About yyscan_t, , Extra Data, Reentrant Detail
|
|
@subsection About yyscan_t
|
|
|
|
@tindex yyscan_t (reentrant only)
|
|
@code{yyscan_t} is defined as:
|
|
|
|
@example
|
|
@verbatim
|
|
typedef void* yyscan_t;
|
|
@end verbatim
|
|
@end example
|
|
|
|
It is initialized by @code{yylex_init()} to point to
|
|
an internal structure. You should never access this value
|
|
directly. In particular, you should never attempt to free it
|
|
(use @code{yylex_destroy()} instead.)
|
|
|
|
@node Reentrant Functions, , Reentrant Detail, Reentrant
|
|
@section Functions and Macros Available in Reentrant C Scanners
|
|
|
|
The following Functions are available in a reentrant scanner:
|
|
|
|
@findex yyget_text
|
|
@findex yyget_leng
|
|
@findex yyget_in
|
|
@findex yyget_out
|
|
@findex yyget_lineno
|
|
@findex yyset_in
|
|
@findex yyset_out
|
|
@findex yyset_lineno
|
|
@findex yyget_debug
|
|
@findex yyset_debug
|
|
@findex yyget_extra
|
|
@findex yyset_extra
|
|
|
|
@example
|
|
@verbatim
|
|
char *yyget_text ( yyscan_t scanner );
|
|
int yyget_leng ( yyscan_t scanner );
|
|
FILE *yyget_in ( yyscan_t scanner );
|
|
FILE *yyget_out ( yyscan_t scanner );
|
|
int yyget_lineno ( yyscan_t scanner );
|
|
YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner );
|
|
int yyget_debug ( yyscan_t scanner );
|
|
|
|
void yyset_debug ( int flag, yyscan_t scanner );
|
|
void yyset_in ( FILE * in_str , yyscan_t scanner );
|
|
void yyset_out ( FILE * out_str , yyscan_t scanner );
|
|
void yyset_lineno ( int line_number , yyscan_t scanner );
|
|
void yyset_extra ( YY_EXTRA_TYPE user_defined , yyscan_t scanner );
|
|
@end verbatim
|
|
@end example
|
|
|
|
There are no ``set'' functions for yytext and yyleng. This is intentional.
|
|
|
|
The following Macro shortcuts are available in actions in a reentrant
|
|
scanner:
|
|
|
|
@example
|
|
@verbatim
|
|
yytext
|
|
yyleng
|
|
yyin
|
|
yyout
|
|
yylineno
|
|
yyextra
|
|
yy_flex_debug
|
|
@end verbatim
|
|
@end example
|
|
|
|
@cindex yylineno, in a reentrant scanner
|
|
In a reentrant C scanner, support for yylineno is always present
|
|
(i.e., you may access yylineno), but the value is never modified by
|
|
@code{flex} unless @code{%option yylineno} is enabled. This is to allow
|
|
the user to maintain the line count independently of @code{flex}.
|
|
|
|
@anchor{bison-functions}
|
|
The following functions and macros are made available when @code{%option
|
|
bison-bridge} (@samp{--bison-bridge}) is specified:
|
|
|
|
@example
|
|
@verbatim
|
|
YYSTYPE * yyget_lval ( yyscan_t scanner );
|
|
void yyset_lval ( YYSTYPE * yylvalp , yyscan_t scanner );
|
|
yylval
|
|
@end verbatim
|
|
@end example
|
|
|
|
The following functions and macros are made available
|
|
when @code{%option bison-locations} (@samp{--bison-locations}) is specified:
|
|
|
|
@example
|
|
@verbatim
|
|
YYLTYPE *yyget_lloc ( yyscan_t scanner );
|
|
void yyset_lloc ( YYLTYPE * yyllocp , yyscan_t scanner );
|
|
yylloc
|
|
@end verbatim
|
|
@end example
|
|
|
|
Support for yylval assumes that @code{YYSTYPE} is a valid type. Support for
|
|
yylloc assumes that @code{YYSLYPE} is a valid type. Typically, these types are
|
|
generated by @code{bison}, and are included in section 1 of the @code{flex}
|
|
input.
|
|
|
|
@node Lex and Posix, Memory Management, Reentrant, Top
|
|
@chapter Incompatibilities with Lex and Posix
|
|
|
|
@cindex POSIX and lex
|
|
@cindex lex (traditional) and POSIX
|
|
|
|
@code{flex} is a rewrite of the AT&T Unix @emph{lex} tool (the two
|
|
implementations do not share any code, though), with some extensions and
|
|
incompatibilities, both of which are of concern to those who wish to
|
|
write scanners acceptable to both implementations. @code{flex} is fully
|
|
compliant with the POSIX @code{lex} specification, except that when
|
|
using @code{%pointer} (the default), a call to @code{unput()} destroys
|
|
the contents of @code{yytext}, which is counter to the POSIX
|
|
specification. In this section we discuss all of the known areas of
|
|
incompatibility between @code{flex}, AT&T @code{lex}, and the POSIX
|
|
specification. @code{flex}'s @samp{-l} option turns on maximum
|
|
compatibility with the original AT&T @code{lex} implementation, at the
|
|
cost of a major loss in the generated scanner's performance. We note
|
|
below which incompatibilities can be overcome using the @samp{-l}
|
|
option. @code{flex} is fully compatible with @code{lex} with the
|
|
following exceptions:
|
|
|
|
@itemize
|
|
@item
|
|
The undocumented @code{lex} scanner internal variable @code{yylineno} is
|
|
not supported unless @samp{-l} or @code{%option yylineno} is used.
|
|
|
|
@item
|
|
@code{yylineno} should be maintained on a per-buffer basis, rather than
|
|
a per-scanner (single global variable) basis.
|
|
|
|
@item
|
|
@code{yylineno} is not part of the POSIX specification.
|
|
|
|
@item
|
|
The @code{input()} routine is not redefinable, though it may be called
|
|
to read characters following whatever has been matched by a rule. If
|
|
@code{input()} encounters an end-of-file the normal @code{yywrap()}
|
|
processing is done. A ``real'' end-of-file is returned by
|
|
@code{input()} as @code{EOF}.
|
|
|
|
@item
|
|
Input is instead controlled by defining the @code{YY_INPUT()} macro.
|
|
|
|
@item
|
|
The @code{flex} restriction that @code{input()} cannot be redefined is
|
|
in accordance with the POSIX specification, which simply does not
|
|
specify any way of controlling the scanner's input other than by making
|
|
an initial assignment to @file{yyin}.
|
|
|
|
@item
|
|
The @code{unput()} routine is not redefinable. This restriction is in
|
|
accordance with POSIX.
|
|
|
|
@item
|
|
@code{flex} scanners are not as reentrant as @code{lex} scanners. In
|
|
particular, if you have an interactive scanner and an interrupt handler
|
|
which long-jumps out of the scanner, and the scanner is subsequently
|
|
called again, you may get the following message:
|
|
|
|
@cindex error messages, end of buffer missed
|
|
@example
|
|
@verbatim
|
|
fatal flex scanner internal error--end of buffer missed
|
|
@end verbatim
|
|
@end example
|
|
|
|
To reenter the scanner, first use:
|
|
|
|
@cindex restarting the scanner
|
|
@example
|
|
@verbatim
|
|
yyrestart( yyin );
|
|
@end verbatim
|
|
@end example
|
|
|
|
Note that this call will throw away any buffered input; usually this
|
|
isn't a problem with an interactive scanner. @xref{Reentrant}, for
|
|
@code{flex}'s reentrant API.
|
|
|
|
@item
|
|
Also note that @code{flex} C++ scanner classes
|
|
@emph{are}
|
|
reentrant, so if using C++ is an option for you, you should use
|
|
them instead. @xref{Cxx}, and @ref{Reentrant} for details.
|
|
|
|
@item
|
|
@code{output()} is not supported. Output from the @b{ECHO} macro is
|
|
done to the file-pointer @code{yyout} (default @file{stdout)}.
|
|
|
|
@item
|
|
@code{output()} is not part of the POSIX specification.
|
|
|
|
@item
|
|
@code{lex} does not support exclusive start conditions (%x), though they
|
|
are in the POSIX specification.
|
|
|
|
@item
|
|
When definitions are expanded, @code{flex} encloses them in parentheses.
|
|
With @code{lex}, the following:
|
|
|
|
@cindex name definitions, not POSIX
|
|
@example
|
|
@verbatim
|
|
NAME [A-Z][A-Z0-9]*
|
|
%%
|
|
foo{NAME}? printf( "Found it\n" );
|
|
%%
|
|
@end verbatim
|
|
@end example
|
|
|
|
will not match the string @samp{foo} because when the macro is expanded
|
|
the rule is equivalent to @samp{foo[A-Z][A-Z0-9]*?} and the precedence
|
|
is such that the @samp{?} is associated with @samp{[A-Z0-9]*}. With
|
|
@code{flex}, the rule will be expanded to @samp{foo([A-Z][A-Z0-9]*)?}
|
|
and so the string @samp{foo} will match.
|
|
|
|
@item
|
|
Note that if the definition begins with @samp{^} or ends with @samp{$}
|
|
then it is @emph{not} expanded with parentheses, to allow these
|
|
operators to appear in definitions without losing their special
|
|
meanings. But the @samp{<s>}, @samp{/}, and @code{<<EOF>>} operators
|
|
cannot be used in a @code{flex} definition.
|
|
|
|
@item
|
|
Using @samp{-l} results in the @code{lex} behavior of no parentheses
|
|
around the definition.
|
|
|
|
@item
|
|
The POSIX specification is that the definition be enclosed in parentheses.
|
|
|
|
@item
|
|
Some implementations of @code{lex} allow a rule's action to begin on a
|
|
separate line, if the rule's pattern has trailing whitespace:
|
|
|
|
@cindex patterns and actions on different lines
|
|
@example
|
|
@verbatim
|
|
%%
|
|
foo|bar<space here>
|
|
{ foobar_action();}
|
|
@end verbatim
|
|
@end example
|
|
|
|
@code{flex} does not support this feature.
|
|
|
|
@item
|
|
The @code{lex} @code{%r} (generate a Ratfor scanner) option is not
|
|
supported. It is not part of the POSIX specification.
|
|
|
|
@item
|
|
After a call to @code{unput()}, @emph{yytext} is undefined until the
|
|
next token is matched, unless the scanner was built using @code{%array}.
|
|
This is not the case with @code{lex} or the POSIX specification. The
|
|
@samp{-l} option does away with this incompatibility.
|
|
|
|
@item
|
|
The precedence of the @samp{@{,@}} (numeric range) operator is
|
|
different. The AT&T and POSIX specifications of @code{lex}
|
|
interpret @samp{abc@{1,3@}} as match one, two,
|
|
or three occurrences of @samp{abc}'', whereas @code{flex} interprets it
|
|
as ``match @samp{ab} followed by one, two, or three occurrences of
|
|
@samp{c}''. The @samp{-l} and @samp{--posix} options do away with this
|
|
incompatibility.
|
|
|
|
@item
|
|
The precedence of the @samp{^} operator is different. @code{lex}
|
|
interprets @samp{^foo|bar} as ``match either 'foo' at the beginning of a
|
|
line, or 'bar' anywhere'', whereas @code{flex} interprets it as ``match
|
|
either @samp{foo} or @samp{bar} if they come at the beginning of a
|
|
line''. The latter is in agreement with the POSIX specification.
|
|
|
|
@item
|
|
The special table-size declarations such as @code{%a} supported by
|
|
@code{lex} are not required by @code{flex} scanners.. @code{flex}
|
|
ignores them.
|
|
@item
|
|
The name @code{FLEX_SCANNER} is @code{#define}'d so scanners may be
|
|
written for use with either @code{flex} or @code{lex}. Scanners also
|
|
include @code{YY_FLEX_MAJOR_VERSION}, @code{YY_FLEX_MINOR_VERSION}
|
|
and @code{YY_FLEX_SUBMINOR_VERSION}
|
|
indicating which version of @code{flex} generated the scanner. For
|
|
example, for the 2.5.22 release, these defines would be 2, 5 and 22
|
|
respectively. If the version of @code{flex} being used is a beta
|
|
version, then the symbol @code{FLEX_BETA} is defined.
|
|
|
|
@item
|
|
The symbols @samp{[[} and @samp{]]} in the code sections of the input
|
|
may conflict with the m4 delimiters. @xref{M4 Dependency}.
|
|
|
|
|
|
@end itemize
|
|
|
|
@cindex POSIX comp;compliance
|
|
@cindex non-POSIX features of flex
|
|
The following @code{flex} features are not included in @code{lex} or the
|
|
POSIX specification:
|
|
|
|
@itemize
|
|
@item
|
|
C++ scanners
|
|
@item
|
|
%option
|
|
@item
|
|
start condition scopes
|
|
@item
|
|
start condition stacks
|
|
@item
|
|
interactive/non-interactive scanners
|
|
@item
|
|
yy_scan_string() and friends
|
|
@item
|
|
yyterminate()
|
|
@item
|
|
yy_set_interactive()
|
|
@item
|
|
yy_set_bol()
|
|
@item
|
|
YY_AT_BOL()
|
|
<<EOF>>
|
|
@item
|
|
<*>
|
|
@item
|
|
YY_DECL
|
|
@item
|
|
YY_START
|
|
@item
|
|
YY_USER_ACTION
|
|
@item
|
|
YY_USER_INIT
|
|
@item
|
|
#line directives
|
|
@item
|
|
%@{@}'s around actions
|
|
@item
|
|
reentrant C API
|
|
@item
|
|
multiple actions on a line
|
|
@item
|
|
almost all of the @code{flex} command-line options
|
|
@end itemize
|
|
|
|
The feature ``multiple actions on a line''
|
|
refers to the fact that with @code{flex} you can put multiple actions on
|
|
the same line, separated with semi-colons, while with @code{lex}, the
|
|
following:
|
|
|
|
@example
|
|
@verbatim
|
|
foo handle_foo(); ++num_foos_seen;
|
|
@end verbatim
|
|
@end example
|
|
|
|
is (rather surprisingly) truncated to
|
|
|
|
@example
|
|
@verbatim
|
|
foo handle_foo();
|
|
@end verbatim
|
|
@end example
|
|
|
|
@code{flex} does not truncate the action. Actions that are not enclosed
|
|
in braces are simply terminated at the end of the line.
|
|
|
|
@node Memory Management, Serialized Tables, Lex and Posix, Top
|
|
@chapter Memory Management
|
|
|
|
@cindex memory management
|
|
@anchor{memory-management}
|
|
This chapter describes how flex handles dynamic memory, and how you can
|
|
override the default behavior.
|
|
|
|
@menu
|
|
* The Default Memory Management::
|
|
* Overriding The Default Memory Management::
|
|
* A Note About yytext And Memory::
|
|
@end menu
|
|
|
|
@node The Default Memory Management, Overriding The Default Memory Management, Memory Management, Memory Management
|
|
@section The Default Memory Management
|
|
|
|
Flex allocates dynamic memory during initialization, and once in a while from
|
|
within a call to yylex(). Initialization takes place during the first call to
|
|
yylex(). Thereafter, flex may reallocate more memory if it needs to enlarge a
|
|
buffer. As of version 2.5.9 Flex will clean up all memory when you call @code{yylex_destroy}
|
|
@xref{faq-memory-leak}.
|
|
|
|
Flex allocates dynamic memory for four purposes, listed below @footnote{The
|
|
quantities given here are approximate, and may vary due to host architecture,
|
|
compiler configuration, or due to future enhancements to flex.}
|
|
|
|
@table @asis
|
|
|
|
@item 16kB for the input buffer.
|
|
Flex allocates memory for the character buffer used to perform pattern
|
|
matching. Flex must read ahead from the input stream and store it in a large
|
|
character buffer. This buffer is typically the largest chunk of dynamic memory
|
|
flex consumes. This buffer will grow if necessary, doubling the size each time.
|
|
Flex frees this memory when you call yylex_destroy(). The default size of this
|
|
buffer (16384 bytes) is almost always too large. The ideal size for this
|
|
buffer is the length of the longest token expected, in bytes, plus a little more. Flex will allocate a few
|
|
extra bytes for housekeeping. Currently, to override the size of the input buffer
|
|
you must @code{#define YY_BUF_SIZE} to whatever number of bytes you want. We don't plan
|
|
to change this in the near future, but we reserve the right to do so if we ever add a more robust memory management
|
|
API.
|
|
|
|
@item 64kb for the REJECT state. This will only be allocated if you use REJECT.
|
|
The size is large enough to hold the same number of states as characters in the input buffer. If you override the size of the
|
|
input buffer (via @code{YY_BUF_SIZE}), then you automatically override the size of this buffer as well.
|
|
|
|
@item 100 bytes for the start condition stack.
|
|
Flex allocates memory for the start condition stack. This is the stack used
|
|
for pushing start states, i.e., with yy_push_state(). It will grow if
|
|
necessary. Since the states are simply integers, this stack doesn't consume
|
|
much memory. This stack is not present if @code{%option stack} is not
|
|
specified. You will rarely need to tune this buffer. The ideal size for this
|
|
stack is the maximum depth expected. The memory for this stack is
|
|
automatically destroyed when you call yylex_destroy(). @xref{option-stack}.
|
|
|
|
@item 40 bytes for each YY_BUFFER_STATE.
|
|
Flex allocates memory for each YY_BUFFER_STATE. The buffer state itself
|
|
is about 40 bytes, plus an additional large character buffer (described above.)
|
|
The initial buffer state is created during initialization, and with each call
|
|
to yy_create_buffer(). You can't tune the size of this, but you can tune the
|
|
character buffer as described above. Any buffer state that you explicitly
|
|
create by calling yy_create_buffer() is @emph{NOT} destroyed automatically. You
|
|
must call yy_delete_buffer() to free the memory. The exception to this rule is
|
|
that flex will delete the current buffer automatically when you call
|
|
yylex_destroy(). If you delete the current buffer, be sure to set it to NULL.
|
|
That way, flex will not try to delete the buffer a second time (possibly
|
|
crashing your program!) At the time of this writing, flex does not provide a
|
|
growable stack for the buffer states. You have to manage that yourself.
|
|
@xref{Multiple Input Buffers}.
|
|
|
|
@item 84 bytes for the reentrant scanner guts
|
|
Flex allocates about 84 bytes for the reentrant scanner structure when
|
|
you call yylex_init(). It is destroyed when the user calls yylex_destroy().
|
|
|
|
@end table
|
|
|
|
|
|
@node Overriding The Default Memory Management, A Note About yytext And Memory, The Default Memory Management, Memory Management
|
|
@section Overriding The Default Memory Management
|
|
|
|
@cindex yyalloc, overriding
|
|
@cindex yyrealloc, overriding
|
|
@cindex yyfree, overriding
|
|
|
|
Flex calls the functions @code{yyalloc}, @code{yyrealloc}, and @code{yyfree}
|
|
when it needs to allocate or free memory. By default, these functions are
|
|
wrappers around the standard C functions, @code{malloc}, @code{realloc}, and
|
|
@code{free}, respectively. You can override the default implementations by telling
|
|
flex that you will provide your own implementations.
|
|
|
|
To override the default implementations, you must do two things:
|
|
|
|
@enumerate
|
|
|
|
@item Suppress the default implementations by specifying one or more of the
|
|
following options:
|
|
|
|
@itemize
|
|
@opindex noyyalloc
|
|
@item @code{%option noyyalloc}
|
|
@item @code{%option noyyrealloc}
|
|
@item @code{%option noyyfree}.
|
|
@end itemize
|
|
|
|
@item Provide your own implementation of the following functions: @footnote{It
|
|
is not necessary to override all (or any) of the memory management routines.
|
|
You may, for example, override @code{yyrealloc}, but not @code{yyfree} or
|
|
@code{yyalloc}.}
|
|
|
|
@example
|
|
@verbatim
|
|
// For a non-reentrant scanner
|
|
void * yyalloc (size_t bytes);
|
|
void * yyrealloc (void * ptr, size_t bytes);
|
|
void yyfree (void * ptr);
|
|
|
|
// For a reentrant scanner
|
|
void * yyalloc (size_t bytes, void * yyscanner);
|
|
void * yyrealloc (void * ptr, size_t bytes, void * yyscanner);
|
|
void yyfree (void * ptr, void * yyscanner);
|
|
@end verbatim
|
|
@end example
|
|
|
|
@end enumerate
|
|
|
|
In the following example, we will override all three memory routines. We assume
|
|
that there is a custom allocator with garbage collection. In order to make this
|
|
example interesting, we will use a reentrant scanner, passing a pointer to the
|
|
custom allocator through @code{yyextra}.
|
|
|
|
@cindex overriding the memory routines
|
|
@example
|
|
@verbatim
|
|
%{
|
|
#include "some_allocator.h"
|
|
%}
|
|
|
|
/* Suppress the default implementations. */
|
|
%option noyyalloc noyyrealloc noyyfree
|
|
%option reentrant
|
|
|
|
/* Initialize the allocator. */
|
|
#define YY_EXTRA_TYPE struct allocator*
|
|
#define YY_USER_INIT yyextra = allocator_create();
|
|
|
|
%%
|
|
.|\n ;
|
|
%%
|
|
|
|
/* Provide our own implementations. */
|
|
void * yyalloc (size_t bytes, void* yyscanner) {
|
|
return allocator_alloc (yyextra, bytes);
|
|
}
|
|
|
|
void * yyrealloc (void * ptr, size_t bytes, void* yyscanner) {
|
|
return allocator_realloc (yyextra, bytes);
|
|
}
|
|
|
|
void yyfree (void * ptr, void * yyscanner) {
|
|
/* Do nothing -- we leave it to the garbage collector. */
|
|
}
|
|
|
|
@end verbatim
|
|
@end example
|
|
|
|
|
|
@node A Note About yytext And Memory, , Overriding The Default Memory Management, Memory Management
|
|
@section A Note About yytext And Memory
|
|
|
|
@cindex yytext, memory considerations
|
|
|
|
When flex finds a match, @code{yytext} points to the first character of the
|
|
match in the input buffer. The string itself is part of the input buffer, and
|
|
is @emph{NOT} allocated separately. The value of yytext will be overwritten the next
|
|
time yylex() is called. In short, the value of yytext is only valid from within
|
|
the matched rule's action.
|
|
|
|
Often, you want the value of yytext to persist for later processing, i.e., by a
|
|
parser with non-zero lookahead. In order to preserve yytext, you will have to
|
|
copy it with strdup() or a similar function. But this introduces some headache
|
|
because your parser is now responsible for freeing the copy of yytext. If you
|
|
use a yacc or bison parser, (commonly used with flex), you will discover that
|
|
the error recovery mechanisms can cause memory to be leaked.
|
|
|
|
To prevent memory leaks from strdup'd yytext, you will have to track the memory
|
|
somehow. Our experience has shown that a garbage collection mechanism or a
|
|
pooled memory mechanism will save you a lot of grief when writing parsers.
|
|
|
|
@node Serialized Tables, Diagnostics, Memory Management, Top
|
|
@chapter Serialized Tables
|
|
@cindex serialization
|
|
@cindex memory, serialized tables
|
|
|
|
@anchor{serialization}
|
|
A @code{flex} scanner has the ability to save the DFA tables to a file, and
|
|
load them at runtime when needed. The motivation for this feature is to reduce
|
|
the runtime memory footprint. Traditionally, these tables have been compiled into
|
|
the scanner as C arrays, and are sometimes quite large. Since the tables are
|
|
compiled into the scanner, the memory used by the tables can never be freed.
|
|
This is a waste of memory, especially if an application uses several scanners,
|
|
but none of them at the same time.
|
|
|
|
The serialization feature allows the tables to be loaded at runtime, before
|
|
scanning begins. The tables may be discarded when scanning is finished.
|
|
|
|
@menu
|
|
* Creating Serialized Tables::
|
|
* Loading and Unloading Serialized Tables::
|
|
* Tables File Format::
|
|
@end menu
|
|
|
|
@node Creating Serialized Tables, Loading and Unloading Serialized Tables, Serialized Tables, Serialized Tables
|
|
@section Creating Serialized Tables
|
|
@cindex tables, creating serialized
|
|
@cindex serialization of tables
|
|
|
|
You may create a scanner with serialized tables by specifying:
|
|
|
|
@example
|
|
@verbatim
|
|
%option tables-file=FILE
|
|
or
|
|
--tables-file=FILE
|
|
@end verbatim
|
|
@end example
|
|
|
|
These options instruct flex to save the DFA tables to the file @var{FILE}. The tables
|
|
will @emph{not} be embedded in the generated scanner. The scanner will not
|
|
function on its own. The scanner will be dependent upon the serialized tables. You must
|
|
load the tables from this file at runtime before you can scan anything.
|
|
|
|
If you do not specify a filename to @code{--tables-file}, the tables will be
|
|
saved to @file{lex.yy.tables}, where @samp{yy} is the appropriate prefix.
|
|
|
|
If your project uses several different scanners, you can concatenate the
|
|
serialized tables into one file, and flex will find the correct set of tables,
|
|
using the scanner prefix as part of the lookup key. An example follows:
|
|
|
|
@cindex serialized tables, multiple scanners
|
|
@example
|
|
@verbatim
|
|
$ flex --tables-file --prefix=cpp cpp.l
|
|
$ flex --tables-file --prefix=c c.l
|
|
$ cat lex.cpp.tables lex.c.tables > all.tables
|
|
@end verbatim
|
|
@end example
|
|
|
|
The above example created two scanners, @samp{cpp}, and @samp{c}. Since we did
|
|
not specify a filename, the tables were serialized to @file{lex.c.tables} and
|
|
@file{lex.cpp.tables}, respectively. Then, we concatenated the two files
|
|
together into @file{all.tables}, which we will distribute with our project. At
|
|
runtime, we will open the file and tell flex to load the tables from it. Flex
|
|
will find the correct tables automatically. (See next section).
|
|
|
|
@node Loading and Unloading Serialized Tables, Tables File Format, Creating Serialized Tables, Serialized Tables
|
|
@section Loading and Unloading Serialized Tables
|
|
@cindex tables, loading and unloading
|
|
@cindex loading tables at runtime
|
|
@cindex tables, freeing
|
|
@cindex freeing tables
|
|
@cindex memory, serialized tables
|
|
|
|
If you've built your scanner with @code{%option tables-file}, then you must
|
|
load the scanner tables at runtime. This can be accomplished with the following
|
|
function:
|
|
|
|
@deftypefun int yytables_fload (FILE* @var{fp} [, yyscan_t @var{scanner}])
|
|
Locates scanner tables in the stream pointed to by @var{fp} and loads them.
|
|
Memory for the tables is allocated via @code{yyalloc}. You must call this
|
|
function before the first call to @code{yylex}. The argument @var{scanner}
|
|
only appears in the reentrant scanner.
|
|
This function returns @samp{0} (zero) on success, or non-zero on error.
|
|
@end deftypefun
|
|
|
|
The loaded tables are @strong{not} automatically destroyed (unloaded) when you
|
|
call @code{yylex_destroy}. The reason is that you may create several scanners
|
|
of the same type (in a reentrant scanner), each of which needs access to these
|
|
tables. To avoid a nasty memory leak, you must call the following function:
|
|
|
|
@deftypefun int yytables_destroy ([yyscan_t @var{scanner}])
|
|
Unloads the scanner tables. The tables must be loaded again before you can scan
|
|
any more data. The argument @var{scanner} only appears in the reentrant
|
|
scanner. This function returns @samp{0} (zero) on success, or non-zero on
|
|
error.
|
|
@end deftypefun
|
|
|
|
@strong{The functions @code{yytables_fload} and @code{yytables_destroy} are not
|
|
thread-safe.} You must ensure that these functions are called exactly once (for
|
|
each scanner type) in a threaded program, before any thread calls @code{yylex}.
|
|
After the tables are loaded, they are never written to, and no thread
|
|
protection is required thereafter -- until you destroy them.
|
|
|
|
@node Tables File Format, , Loading and Unloading Serialized Tables, Serialized Tables
|
|
@section Tables File Format
|
|
@cindex tables, file format
|
|
@cindex file format, serialized tables
|
|
|
|
This section defines the file format of serialized @code{flex} tables.
|
|
|
|
The tables format allows for one or more sets of tables to be
|
|
specified, where each set corresponds to a given scanner. Scanners are
|
|
indexed by name, as described below. The file format is as follows:
|
|
|
|
@example
|
|
@verbatim
|
|
TABLE SET 1
|
|
+-------------------------------+
|
|
Header | uint32 th_magic; |
|
|
| uint32 th_hsize; |
|
|
| uint32 th_ssize; |
|
|
| uint16 th_flags; |
|
|
| char th_version[]; |
|
|
| char th_name[]; |
|
|
| uint8 th_pad64[]; |
|
|
+-------------------------------+
|
|
Table 1 | uint16 td_id; |
|
|
| uint16 td_flags; |
|
|
| uint32 td_hilen; |
|
|
| uint32 td_lolen; |
|
|
| void td_data[]; |
|
|
| uint8 td_pad64[]; |
|
|
+-------------------------------+
|
|
Table 2 | |
|
|
. . .
|
|
. . .
|
|
. . .
|
|
. . .
|
|
Table n | |
|
|
+-------------------------------+
|
|
TABLE SET 2
|
|
.
|
|
.
|
|
.
|
|
TABLE SET N
|
|
@end verbatim
|
|
@end example
|
|
|
|
The above diagram shows that a complete set of tables consists of a header
|
|
followed by multiple individual tables. Furthermore, multiple complete sets may
|
|
be present in the same file, each set with its own header and tables. The sets
|
|
are contiguous in the file. The only way to know if another set follows is to
|
|
check the next four bytes for the magic number (or check for EOF). The header
|
|
and tables sections are padded to 64-bit boundaries. Below we describe each
|
|
field in detail. This format does not specify how the scanner will expand the
|
|
given data, i.e., data may be serialized as int8, but expanded to an int32
|
|
array at runtime. This is to reduce the size of the serialized data where
|
|
possible. Remember, @emph{all integer values are in network byte order}.
|
|
|
|
@noindent
|
|
Fields of a table header:
|
|
|
|
@table @code
|
|
@item th_magic
|
|
Magic number, always 0xF13C57B1.
|
|
|
|
@item th_hsize
|
|
Size of this entire header, in bytes, including all fields plus any padding.
|
|
|
|
@item th_ssize
|
|
Size of this entire set, in bytes, including the header, all tables, plus
|
|
any padding.
|
|
|
|
@item th_flags
|
|
Bit flags for this table set. Currently unused.
|
|
|
|
@item th_version[]
|
|
Flex version in NULL-terminated string format. e.g., @samp{2.5.13a}. This is
|
|
the version of flex that was used to create the serialized tables.
|
|
|
|
@item th_name[]
|
|
Contains the name of this table set. The default is @samp{yytables},
|
|
and is prefixed accordingly, e.g., @samp{footables}. Must be NULL-terminated.
|
|
|
|
@item th_pad64[]
|
|
Zero or more NULL bytes, padding the entire header to the next 64-bit boundary
|
|
as calculated from the beginning of the header.
|
|
@end table
|
|
|
|
@noindent
|
|
Fields of a table:
|
|
|
|
@table @code
|
|
@item td_id
|
|
Specifies the table identifier. Possible values are:
|
|
@table @code
|
|
@item YYTD_ID_ACCEPT (0x01)
|
|
@code{yy_accept}
|
|
@item YYTD_ID_BASE (0x02)
|
|
@code{yy_base}
|
|
@item YYTD_ID_CHK (0x03)
|
|
@code{yy_chk}
|
|
@item YYTD_ID_DEF (0x04)
|
|
@code{yy_def}
|
|
@item YYTD_ID_EC (0x05)
|
|
@code{yy_ec }
|
|
@item YYTD_ID_META (0x06)
|
|
@code{yy_meta}
|
|
@item YYTD_ID_NUL_TRANS (0x07)
|
|
@code{yy_NUL_trans}
|
|
@item YYTD_ID_NXT (0x08)
|
|
@code{yy_nxt}. This array may be two dimensional. See the @code{td_hilen}
|
|
field below.
|
|
@item YYTD_ID_RULE_CAN_MATCH_EOL (0x09)
|
|
@code{yy_rule_can_match_eol}
|
|
@item YYTD_ID_START_STATE_LIST (0x0A)
|
|
@code{yy_start_state_list}. This array is handled specially because it is an
|
|
array of pointers to structs. See the @code{td_flags} field below.
|
|
@item YYTD_ID_TRANSITION (0x0B)
|
|
@code{yy_transition}. This array is handled specially because it is an array of
|
|
structs. See the @code{td_lolen} field below.
|
|
@item YYTD_ID_ACCLIST (0x0C)
|
|
@code{yy_acclist}
|
|
@end table
|
|
|
|
@item td_flags
|
|
Bit flags describing how to interpret the data in @code{td_data}.
|
|
The data arrays are one-dimensional by default, but may be
|
|
two dimensional as specified in the @code{td_hilen} field.
|
|
|
|
@table @code
|
|
@item YYTD_DATA8 (0x01)
|
|
The data is serialized as an array of type int8.
|
|
@item YYTD_DATA16 (0x02)
|
|
The data is serialized as an array of type int16.
|
|
@item YYTD_DATA32 (0x04)
|
|
The data is serialized as an array of type int32.
|
|
@item YYTD_PTRANS (0x08)
|
|
The data is a list of indexes of entries in the expanded @code{yy_transition}
|
|
array. Each index should be expanded to a pointer to the corresponding entry
|
|
in the @code{yy_transition} array. We count on the fact that the
|
|
@code{yy_transition} array has already been seen.
|
|
@item YYTD_STRUCT (0x10)
|
|
The data is a list of yy_trans_info structs, each of which consists of
|
|
two integers. There is no padding between struct elements or between structs.
|
|
The type of each member is determined by the @code{YYTD_DATA*} bits.
|
|
@end table
|
|
|
|
@item td_hilen
|
|
If @code{td_hilen} is non-zero, then the data is a two-dimensional array.
|
|
Otherwise, the data is a one-dimensional array. @code{td_hilen} contains the
|
|
number of elements in the higher dimensional array, and @code{td_lolen} contains
|
|
the number of elements in the lowest dimension.
|
|
|
|
Conceptually, @code{td_data} is either @code{sometype td_data[td_lolen]}, or
|
|
@code{sometype td_data[td_hilen][td_lolen]}, where @code{sometype} is specified
|
|
by the @code{td_flags} field. It is possible for both @code{td_lolen} and
|
|
@code{td_hilen} to be zero, in which case @code{td_data} is a zero length
|
|
array, and no data is loaded, i.e., this table is simply skipped. Flex does not
|
|
currently generate tables of zero length.
|
|
|
|
@item td_lolen
|
|
Specifies the number of elements in the lowest dimension array. If this is
|
|
a one-dimensional array, then it is simply the number of elements in this array.
|
|
The element size is determined by the @code{td_flags} field.
|
|
|
|
@item td_data[]
|
|
The table data. This array may be a one- or two-dimensional array, of type
|
|
@code{int8}, @code{int16}, @code{int32}, @code{struct yy_trans_info}, or
|
|
@code{struct yy_trans_info*}, depending upon the values in the
|
|
@code{td_flags}, @code{td_hilen}, and @code{td_lolen} fields.
|
|
|
|
@item td_pad64[]
|
|
Zero or more NULL bytes, padding the entire table to the next 64-bit boundary as
|
|
calculated from the beginning of this table.
|
|
@end table
|
|
|
|
@node Diagnostics, Limitations, Serialized Tables, Top
|
|
@chapter Diagnostics
|
|
|
|
@cindex error reporting, diagnostic messages
|
|
@cindex warnings, diagnostic messages
|
|
|
|
The following is a list of @code{flex} diagnostic messages:
|
|
|
|
@itemize
|
|
@item
|
|
@samp{warning, rule cannot be matched} indicates that the given rule
|
|
cannot be matched because it follows other rules that will always match
|
|
the same text as it. For example, in the following @samp{foo} cannot be
|
|
matched because it comes after an identifier ``catch-all'' rule:
|
|
|
|
@cindex warning, rule cannot be matched
|
|
@example
|
|
@verbatim
|
|
[a-z]+ got_identifier();
|
|
foo got_foo();
|
|
@end verbatim
|
|
@end example
|
|
|
|
Using @code{REJECT} in a scanner suppresses this warning.
|
|
|
|
@item
|
|
@samp{warning, -s option given but default rule can be matched} means
|
|
that it is possible (perhaps only in a particular start condition) that
|
|
the default rule (match any single character) is the only one that will
|
|
match a particular input. Since @samp{-s} was given, presumably this is
|
|
not intended.
|
|
|
|
@item
|
|
@code{reject_used_but_not_detected undefined} or
|
|
@code{yymore_used_but_not_detected undefined}. These errors can occur
|
|
at compile time. They indicate that the scanner uses @code{REJECT} or
|
|
@code{yymore()} but that @code{flex} failed to notice the fact, meaning
|
|
that @code{flex} scanned the first two sections looking for occurrences
|
|
of these actions and failed to find any, but somehow you snuck some in
|
|
(via a #include file, for example). Use @code{%option reject} or
|
|
@code{%option yymore} to indicate to @code{flex} that you really do use
|
|
these features.
|
|
|
|
@item
|
|
@samp{flex scanner jammed}. a scanner compiled with
|
|
@samp{-s} has encountered an input string which wasn't matched by any of
|
|
its rules. This error can also occur due to internal problems.
|
|
|
|
@item
|
|
@samp{token too large, exceeds YYLMAX}. your scanner uses @code{%array}
|
|
and one of its rules matched a string longer than the @code{YYLMAX}
|
|
constant (8K bytes by default). You can increase the value by
|
|
#define'ing @code{YYLMAX} in the definitions section of your @code{flex}
|
|
input.
|
|
|
|
@item
|
|
@samp{scanner requires -8 flag to use the character 'x'}. Your scanner
|
|
specification includes recognizing the 8-bit character @samp{'x'} and
|
|
you did not specify the -8 flag, and your scanner defaulted to 7-bit
|
|
because you used the @samp{-Cf} or @samp{-CF} table compression options.
|
|
See the discussion of the @samp{-7} flag, @ref{Scanner Options}, for
|
|
details.
|
|
|
|
@item
|
|
@samp{flex scanner push-back overflow}. you used @code{unput()} to push
|
|
back so much text that the scanner's buffer could not hold both the
|
|
pushed-back text and the current token in @code{yytext}. Ideally the
|
|
scanner should dynamically resize the buffer in this case, but at
|
|
present it does not.
|
|
|
|
@item
|
|
@samp{input buffer overflow, can't enlarge buffer because scanner uses
|
|
REJECT}. the scanner was working on matching an extremely large token
|
|
and needed to expand the input buffer. This doesn't work with scanners
|
|
that use @code{REJECT}.
|
|
|
|
@item
|
|
@samp{fatal flex scanner internal error--end of buffer missed}. This can
|
|
occur in a scanner which is reentered after a long-jump has jumped out
|
|
(or over) the scanner's activation frame. Before reentering the
|
|
scanner, use:
|
|
@example
|
|
@verbatim
|
|
yyrestart( yyin );
|
|
@end verbatim
|
|
@end example
|
|
or, as noted above, switch to using the C++ scanner class.
|
|
|
|
@item
|
|
@samp{too many start conditions in <> construct!} you listed more start
|
|
conditions in a <> construct than exist (so you must have listed at
|
|
least one of them twice).
|
|
@end itemize
|
|
|
|
@node Limitations, Bibliography, Diagnostics, Top
|
|
@chapter Limitations
|
|
|
|
@cindex limitations of flex
|
|
|
|
Some trailing context patterns cannot be properly matched and generate
|
|
warning messages (@samp{dangerous trailing context}). These are
|
|
patterns where the ending of the first part of the rule matches the
|
|
beginning of the second part, such as @samp{zx*/xy*}, where the 'x*'
|
|
matches the 'x' at the beginning of the trailing context. (Note that
|
|
the POSIX draft states that the text matched by such patterns is
|
|
undefined.) For some trailing context rules, parts which are actually
|
|
fixed-length are not recognized as such, leading to the abovementioned
|
|
performance loss. In particular, parts using @samp{|} or @samp{@{n@}}
|
|
(such as @samp{foo@{3@}}) are always considered variable-length.
|
|
Combining trailing context with the special @samp{|} action can result
|
|
in @emph{fixed} trailing context being turned into the more expensive
|
|
@emph{variable} trailing context. For example, in the following:
|
|
|
|
@cindex warning, dangerous trailing context
|
|
@example
|
|
@verbatim
|
|
%%
|
|
abc |
|
|
xyz/def
|
|
@end verbatim
|
|
@end example
|
|
|
|
Use of @code{unput()} invalidates yytext and yyleng, unless the
|
|
@code{%array} directive or the @samp{-l} option has been used.
|
|
Pattern-matching of @code{NUL}s is substantially slower than matching
|
|
other characters. Dynamic resizing of the input buffer is slow, as it
|
|
entails rescanning all the text matched so far by the current (generally
|
|
huge) token. Due to both buffering of input and read-ahead, you cannot
|
|
intermix calls to @file{<stdio.h>} routines, such as, @b{getchar()},
|
|
with @code{flex} rules and expect it to work. Call @code{input()}
|
|
instead. The total table entries listed by the @samp{-v} flag excludes
|
|
the number of table entries needed to determine what rule has been
|
|
matched. The number of entries is equal to the number of DFA states if
|
|
the scanner does not use @code{REJECT}, and somewhat greater than the
|
|
number of states if it does. @code{REJECT} cannot be used with the
|
|
@samp{-f} or @samp{-F} options.
|
|
|
|
The @code{flex} internal algorithms need documentation.
|
|
|
|
@node Bibliography, FAQ, Limitations, Top
|
|
@chapter Additional Reading
|
|
|
|
You may wish to read more about the following programs:
|
|
@itemize
|
|
@item lex
|
|
@item yacc
|
|
@item sed
|
|
@item awk
|
|
@end itemize
|
|
|
|
The following books may contain material of interest:
|
|
|
|
John Levine, Tony Mason, and Doug Brown,
|
|
@emph{Lex & Yacc},
|
|
O'Reilly and Associates. Be sure to get the 2nd edition.
|
|
|
|
M. E. Lesk and E. Schmidt,
|
|
@emph{LEX -- Lexical Analyzer Generator}
|
|
|
|
Alfred Aho, Ravi Sethi and Jeffrey Ullman, @emph{Compilers: Principles,
|
|
Techniques and Tools}, Addison-Wesley (1986). Describes the
|
|
pattern-matching techniques used by @code{flex} (deterministic finite
|
|
automata).
|
|
|
|
@node FAQ, Appendices, Bibliography, Top
|
|
@unnumbered FAQ
|
|
|
|
From time to time, the @code{flex} maintainer receives certain
|
|
questions. Rather than repeat answers to well-understood problems, we
|
|
publish them here.
|
|
|
|
@menu
|
|
* When was flex born?::
|
|
* How do I expand backslash-escape sequences in C-style quoted strings?::
|
|
* Why do flex scanners call fileno if it is not ANSI compatible?::
|
|
* Does flex support recursive pattern definitions?::
|
|
* How do I skip huge chunks of input (tens of megabytes) while using flex?::
|
|
* Flex is not matching my patterns in the same order that I defined them.::
|
|
* My actions are executing out of order or sometimes not at all.::
|
|
* How can I have multiple input sources feed into the same scanner at the same time?::
|
|
* Can I build nested parsers that work with the same input file?::
|
|
* How can I match text only at the end of a file?::
|
|
* How can I make REJECT cascade across start condition boundaries?::
|
|
* Why cant I use fast or full tables with interactive mode?::
|
|
* How much faster is -F or -f than -C?::
|
|
* If I have a simple grammar cant I just parse it with flex?::
|
|
* Why doesn't yyrestart() set the start state back to INITIAL?::
|
|
* How can I match C-style comments?::
|
|
* The period isn't working the way I expected.::
|
|
* Can I get the flex manual in another format?::
|
|
* Does there exist a "faster" NDFA->DFA algorithm?::
|
|
* How does flex compile the DFA so quickly?::
|
|
* How can I use more than 8192 rules?::
|
|
* How do I abandon a file in the middle of a scan and switch to a new file?::
|
|
* How do I execute code only during initialization (only before the first scan)?::
|
|
* How do I execute code at termination?::
|
|
* Where else can I find help?::
|
|
* Can I include comments in the "rules" section of the file?::
|
|
* I get an error about undefined yywrap().::
|
|
* How can I change the matching pattern at run time?::
|
|
* How can I expand macros in the input?::
|
|
* How can I build a two-pass scanner?::
|
|
* How do I match any string not matched in the preceding rules?::
|
|
* I am trying to port code from AT&T lex that uses yysptr and yysbuf.::
|
|
* Is there a way to make flex treat NULL like a regular character?::
|
|
* Whenever flex can not match the input it says "flex scanner jammed".::
|
|
* Why doesn't flex have non-greedy operators like perl does?::
|
|
* Memory leak - 16386 bytes allocated by malloc.::
|
|
* How do I track the byte offset for lseek()?::
|
|
* How do I use my own I/O classes in a C++ scanner?::
|
|
* How do I skip as many chars as possible?::
|
|
* deleteme00::
|
|
* Are certain equivalent patterns faster than others?::
|
|
* Is backing up a big deal?::
|
|
* Can I fake multi-byte character support?::
|
|
* deleteme01::
|
|
* Can you discuss some flex internals?::
|
|
* unput() messes up yy_at_bol::
|
|
* The | operator is not doing what I want::
|
|
* Why can't flex understand this variable trailing context pattern?::
|
|
* The ^ operator isn't working::
|
|
* Trailing context is getting confused with trailing optional patterns::
|
|
* Is flex GNU or not?::
|
|
* ERASEME53::
|
|
* I need to scan if-then-else blocks and while loops::
|
|
* ERASEME55::
|
|
* ERASEME56::
|
|
* ERASEME57::
|
|
* Is there a repository for flex scanners?::
|
|
* How can I conditionally compile or preprocess my flex input file?::
|
|
* Where can I find grammars for lex and yacc?::
|
|
* I get an end-of-buffer message for each character scanned.::
|
|
* unnamed-faq-62::
|
|
* unnamed-faq-63::
|
|
* unnamed-faq-64::
|
|
* unnamed-faq-65::
|
|
* unnamed-faq-66::
|
|
* unnamed-faq-67::
|
|
* unnamed-faq-68::
|
|
* unnamed-faq-69::
|
|
* unnamed-faq-70::
|
|
* unnamed-faq-71::
|
|
* unnamed-faq-72::
|
|
* unnamed-faq-73::
|
|
* unnamed-faq-74::
|
|
* unnamed-faq-75::
|
|
* unnamed-faq-76::
|
|
* unnamed-faq-77::
|
|
* unnamed-faq-78::
|
|
* unnamed-faq-79::
|
|
* unnamed-faq-80::
|
|
* unnamed-faq-81::
|
|
* unnamed-faq-82::
|
|
* unnamed-faq-83::
|
|
* unnamed-faq-84::
|
|
* unnamed-faq-85::
|
|
* unnamed-faq-86::
|
|
* unnamed-faq-87::
|
|
* unnamed-faq-88::
|
|
* unnamed-faq-90::
|
|
* unnamed-faq-91::
|
|
* unnamed-faq-92::
|
|
* unnamed-faq-93::
|
|
* unnamed-faq-94::
|
|
* unnamed-faq-95::
|
|
* unnamed-faq-96::
|
|
* unnamed-faq-97::
|
|
* unnamed-faq-98::
|
|
* unnamed-faq-99::
|
|
* unnamed-faq-100::
|
|
* unnamed-faq-101::
|
|
* What is the difference between YYLEX_PARAM and YY_DECL?::
|
|
* Why do I get "conflicting types for yylex" error?::
|
|
* How do I access the values set in a Flex action from within a Bison action?::
|
|
@end menu
|
|
|
|
@node When was flex born?
|
|
@unnumberedsec When was flex born?
|
|
|
|
Vern Paxson took over
|
|
the @cite{Software Tools} lex project from Jef Poskanzer in 1982. At that point it
|
|
was written in Ratfor. Around 1987 or so, Paxson translated it into C, and
|
|
a legend was born :-).
|
|
|
|
@node How do I expand backslash-escape sequences in C-style quoted strings?
|
|
@unnumberedsec How do I expand backslash-escape sequences in C-style quoted strings?
|
|
|
|
A key point when scanning quoted strings is that you cannot (easily) write
|
|
a single rule that will precisely match the string if you allow things
|
|
like embedded escape sequences and newlines. If you try to match strings
|
|
with a single rule then you'll wind up having to rescan the string anyway
|
|
to find any escape sequences.
|
|
|
|
Instead you can use exclusive start conditions and a set of rules, one for
|
|
matching non-escaped text, one for matching a single escape, one for
|
|
matching an embedded newline, and one for recognizing the end of the
|
|
string. Each of these rules is then faced with the question of where to
|
|
put its intermediary results. The best solution is for the rules to
|
|
append their local value of @code{yytext} to the end of a ``string literal''
|
|
buffer. A rule like the escape-matcher will append to the buffer the
|
|
meaning of the escape sequence rather than the literal text in @code{yytext}.
|
|
In this way, @code{yytext} does not need to be modified at all.
|
|
|
|
@node Why do flex scanners call fileno if it is not ANSI compatible?
|
|
@unnumberedsec Why do flex scanners call fileno if it is not ANSI compatible?
|
|
|
|
Flex scanners call @code{fileno()} in order to get the file descriptor
|
|
corresponding to @code{yyin}. The file descriptor may be passed to
|
|
@code{isatty()} or @code{read()}, depending upon which @code{%options} you specified.
|
|
If your system does not have @code{fileno()} support, to get rid of the
|
|
@code{read()} call, do not specify @code{%option read}. To get rid of the @code{isatty()}
|
|
call, you must specify one of @code{%option always-interactive} or
|
|
@code{%option never-interactive}.
|
|
|
|
@node Does flex support recursive pattern definitions?
|
|
@unnumberedsec Does flex support recursive pattern definitions?
|
|
|
|
e.g.,
|
|
|
|
@example
|
|
@verbatim
|
|
%%
|
|
block "{"({block}|{statement})*"}"
|
|
@end verbatim
|
|
@end example
|
|
|
|
No. You cannot have recursive definitions. The pattern-matching power of
|
|
regular expressions in general (and therefore flex scanners, too) is
|
|
limited. In particular, regular expressions cannot ``balance'' parentheses
|
|
to an arbitrary degree. For example, it's impossible to write a regular
|
|
expression that matches all strings containing the same number of '@{'s
|
|
as '@}'s. For more powerful pattern matching, you need a parser, such
|
|
as @cite{GNU bison}.
|
|
|
|
@node How do I skip huge chunks of input (tens of megabytes) while using flex?
|
|
@unnumberedsec How do I skip huge chunks of input (tens of megabytes) while using flex?
|
|
|
|
Use @code{fseek()} (or @code{lseek()}) to position yyin, then call @code{yyrestart()}.
|
|
|
|
@node Flex is not matching my patterns in the same order that I defined them.
|
|
@unnumberedsec Flex is not matching my patterns in the same order that I defined them.
|
|
|
|
@code{flex} picks the
|
|
rule that matches the most text (i.e., the longest possible input string).
|
|
This is because @code{flex} uses an entirely different matching technique
|
|
(``deterministic finite automata'') that actually does all of the matching
|
|
simultaneously, in parallel. (Seems impossible, but it's actually a fairly
|
|
simple technique once you understand the principles.)
|
|
|
|
A side-effect of this parallel matching is that when the input matches more
|
|
than one rule, @code{flex} scanners pick the rule that matched the @emph{most} text. This
|
|
is explained further in the manual, in the section @xref{Matching}.
|
|
|
|
If you want @code{flex} to choose a shorter match, then you can work around this
|
|
behavior by expanding your short
|
|
rule to match more text, then put back the extra:
|
|
|
|
@example
|
|
@verbatim
|
|
data_.* yyless( 5 ); BEGIN BLOCKIDSTATE;
|
|
@end verbatim
|
|
@end example
|
|
|
|
Another fix would be to make the second rule active only during the
|
|
@code{<BLOCKIDSTATE>} start condition, and make that start condition exclusive
|
|
by declaring it with @code{%x} instead of @code{%s}.
|
|
|
|
A final fix is to change the input language so that the ambiguity for
|
|
@samp{data_} is removed, by adding characters to it that don't match the
|
|
identifier rule, or by removing characters (such as @samp{_}) from the
|
|
identifier rule so it no longer matches @samp{data_}. (Of course, you might
|
|
also not have the option of changing the input language.)
|
|
|
|
@node My actions are executing out of order or sometimes not at all.
|
|
@unnumberedsec My actions are executing out of order or sometimes not at all.
|
|
|
|
Most likely, you have (in error) placed the opening @samp{@{} of the action
|
|
block on a different line than the rule, e.g.,
|
|
|
|
@example
|
|
@verbatim
|
|
^(foo|bar)
|
|
{ <<<--- WRONG!
|
|
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
@code{flex} requires that the opening @samp{@{} of an action associated with a rule
|
|
begin on the same line as does the rule. You need instead to write your rules
|
|
as follows:
|
|
|
|
@example
|
|
@verbatim
|
|
^(foo|bar) { // CORRECT!
|
|
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
@node How can I have multiple input sources feed into the same scanner at the same time?
|
|
@unnumberedsec How can I have multiple input sources feed into the same scanner at the same time?
|
|
|
|
If @dots{}
|
|
@itemize
|
|
@item
|
|
your scanner is free of backtracking (verified using @code{flex}'s @samp{-b} flag),
|
|
@item
|
|
AND you run your scanner interactively (@samp{-I} option; default unless using special table
|
|
compression options),
|
|
@item
|
|
AND you feed it one character at a time by redefining @code{YY_INPUT} to do so,
|
|
@end itemize
|
|
|
|
then every time it matches a token, it will have exhausted its input
|
|
buffer (because the scanner is free of backtracking). This means you
|
|
can safely use @code{select()} at the point and only call @code{yylex()} for another
|
|
token if @code{select()} indicates there's data available.
|
|
|
|
That is, move the @code{select()} out from the input function to a point where
|
|
it determines whether @code{yylex()} gets called for the next token.
|
|
|
|
With this approach, you will still have problems if your input can arrive
|
|
piecemeal; @code{select()} could inform you that the beginning of a token is
|
|
available, you call @code{yylex()} to get it, but it winds up blocking waiting
|
|
for the later characters in the token.
|
|
|
|
Here's another way: Move your input multiplexing inside of @code{YY_INPUT}. That
|
|
is, whenever @code{YY_INPUT} is called, it @code{select()}'s to see where input is
|
|
available. If input is available for the scanner, it reads and returns the
|
|
next byte. If input is available from another source, it calls whatever
|
|
function is responsible for reading from that source. (If no input is
|
|
available, it blocks until some input is available.) I've used this technique in an
|
|
interpreter I wrote that both reads keyboard input using a @code{flex} scanner and
|
|
IPC traffic from sockets, and it works fine.
|
|
|
|
@node Can I build nested parsers that work with the same input file?
|
|
@unnumberedsec Can I build nested parsers that work with the same input file?
|
|
|
|
This is not going to work without some additional effort. The reason is
|
|
that @code{flex} block-buffers the input it reads from @code{yyin}. This means that the
|
|
``outermost'' @code{yylex()}, when called, will automatically slurp up the first 8K
|
|
of input available on yyin, and subsequent calls to other @code{yylex()}'s won't
|
|
see that input. You might be tempted to work around this problem by
|
|
redefining @code{YY_INPUT} to only return a small amount of text, but it turns out
|
|
that that approach is quite difficult. Instead, the best solution is to
|
|
combine all of your scanners into one large scanner, using a different
|
|
exclusive start condition for each.
|
|
|
|
@node How can I match text only at the end of a file?
|
|
@unnumberedsec How can I match text only at the end of a file?
|
|
|
|
There is no way to write a rule which is ``match this text, but only if
|
|
it comes at the end of the file''. You can fake it, though, if you happen
|
|
to have a character lying around that you don't allow in your input.
|
|
Then you redefine @code{YY_INPUT} to call your own routine which, if it sees
|
|
an @samp{EOF}, returns the magic character first (and remembers to return a
|
|
real @code{EOF} next time it's called). Then you could write:
|
|
|
|
@example
|
|
@verbatim
|
|
<COMMENT>(.|\n)*{EOF_CHAR} /* saw comment at EOF */
|
|
@end verbatim
|
|
@end example
|
|
|
|
@node How can I make REJECT cascade across start condition boundaries?
|
|
@unnumberedsec How can I make REJECT cascade across start condition boundaries?
|
|
|
|
You can do this as follows. Suppose you have a start condition @samp{A}, and
|
|
after exhausting all of the possible matches in @samp{<A>}, you want to try
|
|
matches in @samp{<INITIAL>}. Then you could use the following:
|
|
|
|
@example
|
|
@verbatim
|
|
%x A
|
|
%%
|
|
<A>rule_that_is_long ...; REJECT;
|
|
<A>rule ...; REJECT; /* shorter rule */
|
|
<A>etc.
|
|
...
|
|
<A>.|\n {
|
|
/* Shortest and last rule in <A>, so
|
|
* cascaded REJECTs will eventually
|
|
* wind up matching this rule. We want
|
|
* to now switch to the initial state
|
|
* and try matching from there instead.
|
|
*/
|
|
yyless(0); /* put back matched text */
|
|
BEGIN(INITIAL);
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
@node Why cant I use fast or full tables with interactive mode?
|
|
@unnumberedsec Why can't I use fast or full tables with interactive mode?
|
|
|
|
One of the assumptions
|
|
flex makes is that interactive applications are inherently slow (they're
|
|
waiting on a human after all).
|
|
It has to do with how the scanner detects that it must be finished scanning
|
|
a token. For interactive scanners, after scanning each character the current
|
|
state is looked up in a table (essentially) to see whether there's a chance
|
|
of another input character possibly extending the length of the match. If
|
|
not, the scanner halts. For non-interactive scanners, the end-of-token test
|
|
is much simpler, basically a compare with 0, so no memory bus cycles. Since
|
|
the test occurs in the innermost scanning loop, one would like to make it go
|
|
as fast as possible.
|
|
|
|
Still, it seems reasonable to allow the user to choose to trade off a bit
|
|
of performance in this area to gain the corresponding flexibility. There
|
|
might be another reason, though, why fast scanners don't support the
|
|
interactive option.
|
|
|
|
@node How much faster is -F or -f than -C?
|
|
@unnumberedsec How much faster is -F or -f than -C?
|
|
|
|
Much faster (factor of 2-3).
|
|
|
|
@node If I have a simple grammar cant I just parse it with flex?
|
|
@unnumberedsec If I have a simple grammar can't I just parse it with flex?
|
|
|
|
Is your grammar recursive? That's almost always a sign that you're
|
|
better off using a parser/scanner rather than just trying to use a scanner
|
|
alone.
|
|
|
|
@node Why doesn't yyrestart() set the start state back to INITIAL?
|
|
@unnumberedsec Why doesn't yyrestart() set the start state back to INITIAL?
|
|
|
|
There are two reasons. The first is that there might
|
|
be programs that rely on the start state not changing across file changes.
|
|
The second is that beginning with @code{flex} version 2.4, use of @code{yyrestart()} is no longer required,
|
|
so fixing the problem there doesn't solve the more general problem.
|
|
|
|
@node How can I match C-style comments?
|
|
@unnumberedsec How can I match C-style comments?
|
|
|
|
You might be tempted to try something like this:
|
|
|
|
@example
|
|
@verbatim
|
|
"/*".*"*/" // WRONG!
|
|
@end verbatim
|
|
@end example
|
|
|
|
or, worse, this:
|
|
|
|
@example
|
|
@verbatim
|
|
"/*"(.|\n)"*/" // WRONG!
|
|
@end verbatim
|
|
@end example
|
|
|
|
The above rules will eat too much input, and blow up on things like:
|
|
|
|
@example
|
|
@verbatim
|
|
/* a comment */ do_my_thing( "oops */" );
|
|
@end verbatim
|
|
@end example
|
|
|
|
Here is one way which allows you to track line information:
|
|
|
|
@example
|
|
@verbatim
|
|
<INITIAL>{
|
|
"/*" BEGIN(IN_COMMENT);
|
|
}
|
|
<IN_COMMENT>{
|
|
"*/" BEGIN(INITIAL);
|
|
[^*\n]+ // eat comment in chunks
|
|
"*" // eat the lone star
|
|
\n yylineno++;
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
@node The period isn't working the way I expected.
|
|
@unnumberedsec The '.' isn't working the way I expected.
|
|
|
|
Here are some tips for using @samp{.}:
|
|
|
|
@itemize
|
|
@item
|
|
A common mistake is to place the grouping parenthesis AFTER an operator, when
|
|
you really meant to place the parenthesis BEFORE the operator, e.g., you
|
|
probably want this @code{(foo|bar)+} and NOT this @code{(foo|bar+)}.
|
|
|
|
The first pattern matches the words @samp{foo} or @samp{bar} any number of
|
|
times, e.g., it matches the text @samp{barfoofoobarfoo}. The
|
|
second pattern matches a single instance of @code{foo} or a single instance of
|
|
@code{bar} followed by one or more @samp{r}s, e.g., it matches the text @code{barrrr} .
|
|
@item
|
|
A @samp{.} inside @samp{[]}'s just means a literal@samp{.} (period),
|
|
and NOT ``any character except newline''.
|
|
@item
|
|
Remember that @samp{.} matches any character EXCEPT @samp{\n} (and @samp{EOF}).
|
|
If you really want to match ANY character, including newlines, then use @code{(.|\n)}
|
|
Beware that the regex @code{(.|\n)+} will match your entire input!
|
|
@item
|
|
Finally, if you want to match a literal @samp{.} (a period), then use @samp{[.]} or @samp{"."}
|
|
@end itemize
|
|
|
|
@node Can I get the flex manual in another format?
|
|
@unnumberedsec Can I get the flex manual in another format?
|
|
|
|
The @code{flex} source distribution includes a texinfo manual. You are
|
|
free to convert that texinfo into whatever format you desire. The
|
|
@code{texinfo} package includes tools for conversion to a number of formats.
|
|
|
|
@node Does there exist a "faster" NDFA->DFA algorithm?
|
|
@unnumberedsec Does there exist a "faster" NDFA->DFA algorithm?
|
|
|
|
There's no way around the potential exponential running time - it
|
|
can take you exponential time just to enumerate all of the DFA states.
|
|
In practice, though, the running time is closer to linear, or sometimes
|
|
quadratic.
|
|
|
|
@node How does flex compile the DFA so quickly?
|
|
@unnumberedsec How does flex compile the DFA so quickly?
|
|
|
|
There are two big speed wins that @code{flex} uses:
|
|
|
|
@enumerate
|
|
@item
|
|
It analyzes the input rules to construct equivalence classes for those
|
|
characters that always make the same transitions. It then rewrites the NFA
|
|
using equivalence classes for transitions instead of characters. This cuts
|
|
down the NFA->DFA computation time dramatically, to the point where, for
|
|
uncompressed DFA tables, the DFA generation is often I/O bound in writing out
|
|
the tables.
|
|
@item
|
|
It maintains hash values for previously computed DFA states, so testing
|
|
whether a newly constructed DFA state is equivalent to a previously constructed
|
|
state can be done very quickly, by first comparing hash values.
|
|
@end enumerate
|
|
|
|
@node How can I use more than 8192 rules?
|
|
@unnumberedsec How can I use more than 8192 rules?
|
|
|
|
@code{Flex} is compiled with an upper limit of 8192 rules per scanner.
|
|
If you need more than 8192 rules in your scanner, you'll have to recompile @code{flex}
|
|
with the following changes in @file{flexdef.h}:
|
|
|
|
@example
|
|
@verbatim
|
|
< #define YY_TRAILING_MASK 0x2000
|
|
< #define YY_TRAILING_HEAD_MASK 0x4000
|
|
--
|
|
> #define YY_TRAILING_MASK 0x20000000
|
|
> #define YY_TRAILING_HEAD_MASK 0x40000000
|
|
@end verbatim
|
|
@end example
|
|
|
|
This should work okay as long as your C compiler uses 32 bit integers.
|
|
But you might want to think about whether using such a huge number of rules
|
|
is the best way to solve your problem.
|
|
|
|
The following may also be relevant:
|
|
|
|
With luck, you should be able to increase the definitions in flexdef.h for:
|
|
|
|
@example
|
|
@verbatim
|
|
#define JAMSTATE -32766 /* marks a reference to the state that always jams */
|
|
#define MAXIMUM_MNS 31999
|
|
#define BAD_SUBSCRIPT -32767
|
|
@end verbatim
|
|
@end example
|
|
|
|
recompile everything, and it'll all work. Flex only has these 16-bit-like
|
|
values built into it because a long time ago it was developed on a machine
|
|
with 16-bit ints. I've given this advice to others in the past but haven't
|
|
heard back from them whether it worked okay or not...
|
|
|
|
@node How do I abandon a file in the middle of a scan and switch to a new file?
|
|
@unnumberedsec How do I abandon a file in the middle of a scan and switch to a new file?
|
|
|
|
Just call @code{yyrestart(newfile)}. Be sure to reset the start state if you want a
|
|
``fresh start, since @code{yyrestart} does NOT reset the start state back to @code{INITIAL}.
|
|
|
|
@node How do I execute code only during initialization (only before the first scan)?
|
|
@unnumberedsec How do I execute code only during initialization (only before the first scan)?
|
|
|
|
You can specify an initial action by defining the macro @code{YY_USER_INIT} (though
|
|
note that @code{yyout} may not be available at the time this macro is executed). Or you
|
|
can add to the beginning of your rules section:
|
|
|
|
@example
|
|
@verbatim
|
|
%%
|
|
/* Must be indented! */
|
|
static int did_init = 0;
|
|
|
|
if ( ! did_init ){
|
|
do_my_init();
|
|
did_init = 1;
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
@node How do I execute code at termination?
|
|
@unnumberedsec How do I execute code at termination?
|
|
|
|
You can specify an action for the @code{<<EOF>>} rule.
|
|
|
|
@node Where else can I find help?
|
|
@unnumberedsec Where else can I find help?
|
|
|
|
You can find the flex homepage on the web at
|
|
@uref{http://flex.sourceforge.net/}. See that page for details about flex
|
|
mailing lists as well.
|
|
|
|
@node Can I include comments in the "rules" section of the file?
|
|
@unnumberedsec Can I include comments in the "rules" section of the file?
|
|
|
|
Yes, just about anywhere you want to. See the manual for the specific syntax.
|
|
|
|
@node I get an error about undefined yywrap().
|
|
@unnumberedsec I get an error about undefined yywrap().
|
|
|
|
You must supply a @code{yywrap()} function of your own, or link to @file{libfl.a}
|
|
(which provides one), or use
|
|
|
|
@example
|
|
@verbatim
|
|
%option noyywrap
|
|
@end verbatim
|
|
@end example
|
|
|
|
in your source to say you don't want a @code{yywrap()} function.
|
|
|
|
@node How can I change the matching pattern at run time?
|
|
@unnumberedsec How can I change the matching pattern at run time?
|
|
|
|
You can't, it's compiled into a static table when flex builds the scanner.
|
|
|
|
@node How can I expand macros in the input?
|
|
@unnumberedsec How can I expand macros in the input?
|
|
|
|
The best way to approach this problem is at a higher level, e.g., in the parser.
|
|
|
|
However, you can do this using multiple input buffers.
|
|
|
|
@example
|
|
@verbatim
|
|
%%
|
|
macro/[a-z]+ {
|
|
/* Saw the macro "macro" followed by extra stuff. */
|
|
main_buffer = YY_CURRENT_BUFFER;
|
|
expansion_buffer = yy_scan_string(expand(yytext));
|
|
yy_switch_to_buffer(expansion_buffer);
|
|
}
|
|
|
|
<<EOF>> {
|
|
if ( expansion_buffer )
|
|
{
|
|
// We were doing an expansion, return to where
|
|
// we were.
|
|
yy_switch_to_buffer(main_buffer);
|
|
yy_delete_buffer(expansion_buffer);
|
|
expansion_buffer = 0;
|
|
}
|
|
else
|
|
yyterminate();
|
|
}
|
|
@end verbatim
|
|
@end example
|
|
|
|
You probably will want a stack of expansion buffers to allow nested macros.
|
|
From the above though hopefully the idea is clear.
|
|
|
|
@node How can I build a two-pass scanner?
|
|
@unnumberedsec How can I build a two-pass scanner?
|
|
|
|
One way to do it is to filter the first pass to a temporary file,
|
|
then process the temporary file on the second pass. You will probably see a
|
|
performance hit, due to all the disk I/O.
|
|
|
|
When you need to look ahead far forward like this, it almost always means
|
|
that the right solution is to build a parse tree of the entire input, then
|
|
walk it after the parse in order to generate the output. In a sense, this
|
|
is a two-pass approach, once through the text and once through the parse
|
|
tree, but the performance hit for the latter is usually an order of magnitude
|
|
smaller, since everything is already classified, in binary format, and
|
|
residing in memory.
|
|
|
|
@node How do I match any string not matched in the preceding rules?
|
|
@unnumberedsec How do I match any string not matched in the preceding rules?
|
|
|
|
One way to assign precedence, is to place the more specific rules first. If
|
|
two rules would match the same input (same sequence of characters) then the
|
|
first rule listed in the @code{flex} input wins, e.g.,
|
|
|
|
@example
|
|
@verbatim
|
|
%%
|
|
foo[a-zA-Z_]+ return FOO_ID;
|
|
bar[a-zA-Z_]+ return BAR_ID;
|
|
[a-zA-Z_]+ return GENERIC_ID;
|
|
@end verbatim
|
|
@end example
|
|
|
|
Note that the rule @code{[a-zA-Z_]+} must come *after* the others. It will match the
|
|
same amount of text as the more specific rules, and in that case the
|
|
@code{flex} scanner will pick the first rule listed in your scanner as the
|
|
one to match.
|
|
|
|
@node I am trying to port code from AT&T lex that uses yysptr and yysbuf.
|
|
@unnumberedsec I am trying to port code from AT&T lex that uses yysptr and yysbuf.
|
|
|
|
Those are internal variables pointing into the AT&T scanner's input buffer. I
|
|
imagine they're being manipulated in user versions of the @code{input()} and @code{unput()}
|
|
functions. If so, what you need to do is analyze those functions to figure out
|
|
what they're doing, and then replace @code{input()} with an appropriate definition of
|
|
@code{YY_INPUT}. You shouldn't need to (and must not) replace
|
|
@code{flex}'s @code{unput()} function.
|
|
|
|
@node Is there a way to make flex treat NULL like a regular character?
|
|
@unnumberedsec Is there a way to make flex treat NULL like a regular character?
|
|
|
|
Yes, @samp{\0} and @samp{\x00} should both do the trick. Perhaps you have an ancient
|
|
version of @code{flex}. The latest release is version @value{VERSION}.
|
|
|
|
@node Whenever flex can not match the input it says "flex scanner jammed".
|
|
@unnumberedsec Whenever flex can not match the input it says "flex scanner jammed".
|
|
|
|
You need to add a rule that matches the otherwise-unmatched text,
|
|
e.g.,
|
|
|
|
@example
|
|
@verbatim
|
|
%option yylineno
|
|
%%
|
|
[[a bunch of rules here]]
|
|
|
|
. printf("bad input character '%s' at line %d\n", yytext, yylineno);
|
|
@end verbatim
|
|
@end example
|
|
|
|
See @code{%option default} for more information.
|
|
|
|
@node Why doesn't flex have non-greedy operators like perl does?
|
|
@unnumberedsec Why doesn't flex have non-greedy operators like perl does?
|
|
|
|
A DFA can do a non-greedy match by stopping
|
|
the first time it enters an accepting state, instead of consuming input until
|
|
it determines that no further matching is possible (a ``jam'' state). This
|
|
is actually easier to implement than longest leftmost match (which flex does).
|
|
|
|
But it's also much less useful than longest leftmost match. In general,
|
|
when you find yourself wishing for non-greedy matching, that's usually a
|
|
sign that you're trying to make the scanner do some parsing. That's
|
|
generally the wrong approach, since it lacks the power to do a decent job.
|
|
Better is to either introduce a separate parser, or to split the scanner
|
|
into multiple scanners using (exclusive) start conditions.
|
|
|
|
You might have
|
|
a separate start state once you've seen the @samp{BEGIN}. In that state, you
|
|
might then have a regex that will match @samp{END} (to kick you out of the
|
|
state), and perhaps @samp{(.|\n)} to get a single character within the chunk ...
|
|
|
|
This approach also has much better error-reporting properties.
|
|
|
|
@node Memory leak - 16386 bytes allocated by malloc.
|
|
@unnumberedsec Memory leak - 16386 bytes allocated by malloc.
|
|
@anchor{faq-memory-leak}
|
|
|
|
UPDATED 2002-07-10: As of @code{flex} version 2.5.9, this leak means that you did not
|
|
call @code{yylex_destroy()}. If you are using an earlier version of @code{flex}, then read
|
|
on.
|
|
|
|
The leak is about 16426 bytes. That is, (8192 * 2 + 2) for the read-buffer, and
|
|
about 40 for @code{struct yy_buffer_state} (depending upon alignment). The leak is in
|
|
the non-reentrant C scanner only (NOT in the reentrant scanner, NOT in the C++
|
|
scanner). Since @code{flex} doesn't know when you are done, the buffer is never freed.
|
|
|
|
However, the leak won't multiply since the buffer is reused no matter how many
|
|
times you call @code{yylex()}.
|
|
|
|
If you want to reclaim the memory when you are completely done scanning, then
|
|
you might try this:
|
|
|
|
@example
|
|
@verbatim
|
|
/* For non-reentrant C scanner only. */
|
|
yy_delete_buffer(YY_CURRENT_BUFFER);
|
|
yy_init = 1;
|
|
@end verbatim
|
|
@end example
|
|
|
|
Note: @code{yy_init} is an "internal variable", and hasn't been tested in this
|
|
situation. It is possible that some other globals may need resetting as well.
|
|
|
|
@node How do I track the byte offset for lseek()?
|
|
@unnumberedsec How do I track the byte offset for lseek()?
|
|
|
|
@example
|
|
@verbatim
|
|
> We thought that it would be possible to have this number through the
|
|
> evaluation of the following expression:
|
|
>
|
|
> seek_position = (no_buffers)*YY_READ_BUF_SIZE + yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf
|
|
@end verbatim
|
|
@end example
|
|
|
|
While this is the right idea, it has two problems. The first is that
|
|
it's possible that @code{flex} will request less than @code{YY_READ_BUF_SIZE} during
|
|
an invocation of @code{YY_INPUT} (or that your input source will return less
|
|
even though @code{YY_READ_BUF_SIZE} bytes were requested). The second problem
|
|
is that when refilling its internal buffer, @code{flex} keeps some characters
|
|
from the previous buffer (because usually it's in the middle of a match,
|
|
and needs those characters to construct @code{yytext} for the match once it's
|
|
done). Because of this, @code{yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf} won't
|
|
be exactly the number of characters already read from the current buffer.
|
|
|
|
An alternative solution is to count the number of characters you've matched
|
|
since starting to scan. This can be done by using @code{YY_USER_ACTION}. For
|
|
example,
|
|
|
|
@example
|
|
@verbatim
|
|
#define YY_USER_ACTION num_chars += yyleng;
|
|
@end verbatim
|
|
@end example
|
|
|
|
(You need to be careful to update your bookkeeping if you use @code{yymore(}),
|
|
@code{yyless()}, @code{unput()}, or @code{input()}.)
|
|
|
|
@node How do I use my own I/O classes in a C++ scanner?
|
|
@section How do I use my own I/O classes in a C++ scanner?
|
|
|
|
When the flex C++ scanning class rewrite finally happens, then this sort of thing should become much easier.
|
|
|
|
@cindex LexerOutput, overriding
|
|
@cindex LexerInput, overriding
|
|
@cindex overriding LexerOutput
|
|
@cindex overriding LexerInput
|
|
@cindex customizing I/O in C++ scanners
|
|
@cindex C++ I/O, customizing
|
|
You can do this by passing the various functions (such as @code{LexerInput()}
|
|
and @code{LexerOutput()}) NULL @code{iostream*}'s, and then
|
|
dealing with your own I/O classes surreptitiously (i.e., stashing them in
|
|
special member variables). This works because the only assumption about
|
|
the lexer regarding what's done with the iostream's is that they're
|
|
ultimately passed to @code{LexerInput()} and @code{LexerOutput}, which then do whatever
|
|
is necessary with them.
|
|
|
|
@c faq edit stopped here
|
|
@node How do I skip as many chars as possible?
|
|
@unnumberedsec How do I skip as many chars as possible?
|
|
|
|
How do I skip as many chars as possible -- without interfering with the other
|
|
patterns?
|
|
|
|
In the example below, we want to skip over characters until we see the phrase
|
|
"endskip". The following will @emph{NOT} work correctly (do you see why not?)
|
|
|
|
@example
|
|
@verbatim
|
|
/* INCORRECT SCANNER */
|
|
%x SKIP
|
|
%%
|
|
<INITIAL>startskip BEGIN(SKIP);
|
|
...
|
|
<SKIP>"endskip" BEGIN(INITIAL);
|
|
<SKIP>.* ;
|
|
@end verbatim
|
|
@end example
|
|
|
|
The problem is that the pattern .* will eat up the word "endskip."
|
|
The simplest (but slow) fix is:
|
|
|
|
@example
|
|
@verbatim
|
|
<SKIP>"endskip" BEGIN(INITIAL);
|
|
<SKIP>. ;
|
|
@end verbatim
|
|
@end example
|
|
|
|
The fix involves making the second rule match more, without
|
|
making it match "endskip" plus something else. So for example:
|
|
|
|
@example
|
|
@verbatim
|
|
<SKIP>"endskip" BEGIN(INITIAL);
|
|
<SKIP>[^e]+ ;
|
|
<SKIP>. ;/* so you eat up e's, too */
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node deleteme00
|
|
@unnumberedsec deleteme00
|
|
@example
|
|
@verbatim
|
|
QUESTION:
|
|
When was flex born?
|
|
|
|
Vern Paxson took over
|
|
the Software Tools lex project from Jef Poskanzer in 1982. At that point it
|
|
was written in Ratfor. Around 1987 or so, Paxson translated it into C, and
|
|
a legend was born :-).
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node Are certain equivalent patterns faster than others?
|
|
@unnumberedsec Are certain equivalent patterns faster than others?
|
|
@example
|
|
@verbatim
|
|
To: Adoram Rogel <adoram@orna.hybridge.com>
|
|
Subject: Re: Flex 2.5.2 performance questions
|
|
In-reply-to: Your message of Wed, 18 Sep 96 11:12:17 EDT.
|
|
Date: Wed, 18 Sep 96 10:51:02 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
[Note, the most recent flex release is 2.5.4, which you can get from
|
|
ftp.ee.lbl.gov. It has bug fixes over 2.5.2 and 2.5.3.]
|
|
|
|
> 1. Using the pattern
|
|
> ([Ff](oot)?)?[Nn](ote)?(\.)?
|
|
> instead of
|
|
> (((F|f)oot(N|n)ote)|((N|n)ote)|((N|n)\.)|((F|f)(N|n)(\.)))
|
|
> (in a very complicated flex program) caused the program to slow from
|
|
> 300K+/min to 100K/min (no other changes were done).
|
|
|
|
These two are not equivalent. For example, the first can match "footnote."
|
|
but the second can only match "footnote". This is almost certainly the
|
|
cause in the discrepancy - the slower scanner run is matching more tokens,
|
|
and/or having to do more backing up.
|
|
|
|
> 2. Which of these two are better: [Ff]oot or (F|f)oot ?
|
|
|
|
From a performance point of view, they're equivalent (modulo presumably
|
|
minor effects such as memory cache hit rates; and the presence of trailing
|
|
context, see below). From a space point of view, the first is slightly
|
|
preferable.
|
|
|
|
> 3. I have a pattern that look like this:
|
|
> pats {p1}|{p2}|{p3}|...|{p50} (50 patterns ORd)
|
|
>
|
|
> running yet another complicated program that includes the following rule:
|
|
> <snext>{and}/{no4}{bb}{pats}
|
|
>
|
|
> gets me to "too complicated - over 32,000 states"...
|
|
|
|
I can't tell from this example whether the trailing context is variable-length
|
|
or fixed-length (it could be the latter if {and} is fixed-length). If it's
|
|
variable length, which flex -p will tell you, then this reflects a basic
|
|
performance problem, and if you can eliminate it by restructuring your
|
|
scanner, you will see significant improvement.
|
|
|
|
> so I divided {pats} to {pats1}, {pats2},..., {pats5} each consists of about
|
|
> 10 patterns and changed the rule to be 5 rules.
|
|
> This did compile, but what is the rule of thumb here ?
|
|
|
|
The rule is to avoid trailing context other than fixed-length, in which for
|
|
a/b, either the 'a' pattern or the 'b' pattern have a fixed length. Use
|
|
of the '|' operator automatically makes the pattern variable length, so in
|
|
this case '[Ff]oot' is preferred to '(F|f)oot'.
|
|
|
|
> 4. I changed a rule that looked like this:
|
|
> <snext8>{and}{bb}/{ROMAN}[^A-Za-z] { BEGIN...
|
|
>
|
|
> to the next 2 rules:
|
|
> <snext8>{and}{bb}/{ROMAN}[A-Za-z] { ECHO;}
|
|
> <snext8>{and}{bb}/{ROMAN} { BEGIN...
|
|
>
|
|
> Again, I understand the using [^...] will cause a great performance loss
|
|
|
|
Actually, it doesn't cause any sort of performance loss. It's a surprising
|
|
fact about regular expressions that they always match in linear time
|
|
regardless of how complex they are.
|
|
|
|
> but are there any specific rules about it ?
|
|
|
|
See the "Performance Considerations" section of the man page, and also
|
|
the example in MISC/fastwc/.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node Is backing up a big deal?
|
|
@unnumberedsec Is backing up a big deal?
|
|
@example
|
|
@verbatim
|
|
To: Adoram Rogel <adoram@hybridge.com>
|
|
Subject: Re: Flex 2.5.2 performance questions
|
|
In-reply-to: Your message of Thu, 19 Sep 96 10:16:04 EDT.
|
|
Date: Thu, 19 Sep 96 09:58:00 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> a lot about the backing up problem.
|
|
> I believe that there lies my biggest problem, and I'll try to improve
|
|
> it.
|
|
|
|
Since you have variable trailing context, this is a bigger performance
|
|
problem. Fixing it is usually easier than fixing backing up, which in a
|
|
complicated scanner (yours seems to fit the bill) can be extremely
|
|
difficult to do correctly.
|
|
|
|
You also don't mention what flags you are using for your scanner.
|
|
-f makes a large speed difference, and -Cfe buys you nearly as much
|
|
speed but the resulting scanner is considerably smaller.
|
|
|
|
> I have an | operator in {and} and in {pats} so both of them are variable
|
|
> length.
|
|
|
|
-p should have reported this.
|
|
|
|
> Is changing one of them to fixed-length is enough ?
|
|
|
|
Yes.
|
|
|
|
> Is it possible to change the 32,000 states limit ?
|
|
|
|
Yes. I've appended instructions on how. Before you make this change,
|
|
though, you should think about whether there are ways to fundamentally
|
|
simplify your scanner - those are certainly preferable!
|
|
|
|
Vern
|
|
|
|
To increase the 32K limit (on a machine with 32 bit integers), you increase
|
|
the magnitude of the following in flexdef.h:
|
|
|
|
#define JAMSTATE -32766 /* marks a reference to the state that always jams */
|
|
#define MAXIMUM_MNS 31999
|
|
#define BAD_SUBSCRIPT -32767
|
|
#define MAX_SHORT 32700
|
|
|
|
Adding a 0 or two after each should do the trick.
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node Can I fake multi-byte character support?
|
|
@unnumberedsec Can I fake multi-byte character support?
|
|
@example
|
|
@verbatim
|
|
To: Heeman_Lee@hp.com
|
|
Subject: Re: flex - multi-byte support?
|
|
In-reply-to: Your message of Thu, 03 Oct 1996 17:24:04 PDT.
|
|
Date: Fri, 04 Oct 1996 11:42:18 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> I assume as long as my *.l file defines the
|
|
> range of expected character code values (in octal format), flex will
|
|
> scan the file and read multi-byte characters correctly. But I have no
|
|
> confidence in this assumption.
|
|
|
|
Your lack of confidence is justified - this won't work.
|
|
|
|
Flex has in it a widespread assumption that the input is processed
|
|
one byte at a time. Fixing this is on the to-do list, but is involved,
|
|
so it won't happen any time soon. In the interim, the best I can suggest
|
|
(unless you want to try fixing it yourself) is to write your rules in
|
|
terms of pairs of bytes, using definitions in the first section:
|
|
|
|
X \xfe\xc2
|
|
...
|
|
%%
|
|
foo{X}bar found_foo_fe_c2_bar();
|
|
|
|
etc. Definitely a pain - sorry about that.
|
|
|
|
By the way, the email address you used for me is ancient, indicating you
|
|
have a very old version of flex. You can get the most recent, 2.5.4, from
|
|
ftp.ee.lbl.gov.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node deleteme01
|
|
@unnumberedsec deleteme01
|
|
@example
|
|
@verbatim
|
|
To: moleary@primus.com
|
|
Subject: Re: Flex / Unicode compatibility question
|
|
In-reply-to: Your message of Tue, 22 Oct 1996 10:15:42 PDT.
|
|
Date: Tue, 22 Oct 1996 11:06:13 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
Unfortunately flex at the moment has a widespread assumption within it
|
|
that characters are processed 8 bits at a time. I don't see any easy
|
|
fix for this (other than writing your rules in terms of double characters -
|
|
a pain). I also don't know of a wider lex, though you might try surfing
|
|
the Plan 9 stuff because I know it's a Unicode system, and also the PCCT
|
|
toolkit (try searching say Alta Vista for "Purdue Compiler Construction
|
|
Toolkit").
|
|
|
|
Fixing flex to handle wider characters is on the long-term to-do list.
|
|
But since flex is a strictly spare-time project these days, this probably
|
|
won't happen for quite a while, unless someone else does it first.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node Can you discuss some flex internals?
|
|
@unnumberedsec Can you discuss some flex internals?
|
|
@example
|
|
@verbatim
|
|
To: Johan Linde <jl@theophys.kth.se>
|
|
Subject: Re: translation of flex
|
|
In-reply-to: Your message of Sun, 10 Nov 1996 09:16:36 PST.
|
|
Date: Mon, 11 Nov 1996 10:33:50 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> I'm working for the Swedish team translating GNU program, and I'm currently
|
|
> working with flex. I have a few questions about some of the messages which
|
|
> I hope you can answer.
|
|
|
|
All of the things you're wondering about, by the way, concerning flex
|
|
internals - probably the only person who understands what they mean in
|
|
English is me! So I wouldn't worry too much about getting them right.
|
|
That said ...
|
|
|
|
> #: main.c:545
|
|
> msgid " %d protos created\n"
|
|
>
|
|
> Does proto mean prototype?
|
|
|
|
Yes - prototypes of state compression tables.
|
|
|
|
> #: main.c:539
|
|
> msgid " %d/%d (peak %d) template nxt-chk entries created\n"
|
|
>
|
|
> Here I'm mainly puzzled by 'nxt-chk'. I guess it means 'next-check'. (?)
|
|
> However, 'template next-check entries' doesn't make much sense to me. To be
|
|
> able to find a good translation I need to know a little bit more about it.
|
|
|
|
There is a scheme in the Aho/Sethi/Ullman compiler book for compressing
|
|
scanner tables. It involves creating two pairs of tables. The first has
|
|
"base" and "default" entries, the second has "next" and "check" entries.
|
|
The "base" entry is indexed by the current state and yields an index into
|
|
the next/check table. The "default" entry gives what to do if the state
|
|
transition isn't found in next/check. The "next" entry gives the next
|
|
state to enter, but only if the "check" entry verifies that this entry is
|
|
correct for the current state. Flex creates templates of series of
|
|
next/check entries and then encodes differences from these templates as a
|
|
way to compress the tables.
|
|
|
|
> #: main.c:533
|
|
> msgid " %d/%d base-def entries created\n"
|
|
>
|
|
> The same problem here for 'base-def'.
|
|
|
|
See above.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unput() messes up yy_at_bol
|
|
@unnumberedsec unput() messes up yy_at_bol
|
|
@example
|
|
@verbatim
|
|
To: Xinying Li <xli@npac.syr.edu>
|
|
Subject: Re: FLEX ?
|
|
In-reply-to: Your message of Wed, 13 Nov 1996 17:28:38 PST.
|
|
Date: Wed, 13 Nov 1996 19:51:54 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> "unput()" them to input flow, question occurs. If I do this after I scan
|
|
> a carriage, the variable "YY_CURRENT_BUFFER->yy_at_bol" is changed. That
|
|
> means the carriage flag has gone.
|
|
|
|
You can control this by calling yy_set_bol(). It's described in the manual.
|
|
|
|
> And if in pre-reading it goes to the end of file, is anything done
|
|
> to control the end of curren buffer and end of file?
|
|
|
|
No, there's no way to put back an end-of-file.
|
|
|
|
> By the way I am using flex 2.5.2 and using the "-l".
|
|
|
|
The latest release is 2.5.4, by the way. It fixes some bugs in 2.5.2 and
|
|
2.5.3. You can get it from ftp.ee.lbl.gov.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node The | operator is not doing what I want
|
|
@unnumberedsec The | operator is not doing what I want
|
|
@example
|
|
@verbatim
|
|
To: Alain.ISSARD@st.com
|
|
Subject: Re: Start condition with FLEX
|
|
In-reply-to: Your message of Mon, 18 Nov 1996 09:45:02 PST.
|
|
Date: Mon, 18 Nov 1996 10:41:34 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> I am not able to use the start condition scope and to use the | (OR) with
|
|
> rules having start conditions.
|
|
|
|
The problem is that if you use '|' as a regular expression operator, for
|
|
example "a|b" meaning "match either 'a' or 'b'", then it must *not* have
|
|
any blanks around it. If you instead want the special '|' *action* (which
|
|
from your scanner appears to be the case), which is a way of giving two
|
|
different rules the same action:
|
|
|
|
foo |
|
|
bar matched_foo_or_bar();
|
|
|
|
then '|' *must* be separated from the first rule by whitespace and *must*
|
|
be followed by a new line. You *cannot* write it as:
|
|
|
|
foo | bar matched_foo_or_bar();
|
|
|
|
even though you might think you could because yacc supports this syntax.
|
|
The reason for this unfortunately incompatibility is historical, but it's
|
|
unlikely to be changed.
|
|
|
|
Your problems with start condition scope are simply due to syntax errors
|
|
from your use of '|' later confusing flex.
|
|
|
|
Let me know if you still have problems.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node Why can't flex understand this variable trailing context pattern?
|
|
@unnumberedsec Why can't flex understand this variable trailing context pattern?
|
|
@example
|
|
@verbatim
|
|
To: Gregory Margo <gmargo@newton.vip.best.com>
|
|
Subject: Re: flex-2.5.3 bug report
|
|
In-reply-to: Your message of Sat, 23 Nov 1996 16:50:09 PST.
|
|
Date: Sat, 23 Nov 1996 17:07:32 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> Enclosed is a lex file that "real" lex will process, but I cannot get
|
|
> flex to process it. Could you try it and maybe point me in the right direction?
|
|
|
|
Your problem is that some of the definitions in the scanner use the '/'
|
|
trailing context operator, and have it enclosed in ()'s. Flex does not
|
|
allow this operator to be enclosed in ()'s because doing so allows undefined
|
|
regular expressions such as "(a/b)+". So the solution is to remove the
|
|
parentheses. Note that you must also be building the scanner with the -l
|
|
option for AT&T lex compatibility. Without this option, flex automatically
|
|
encloses the definitions in parentheses.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node The ^ operator isn't working
|
|
@unnumberedsec The ^ operator isn't working
|
|
@example
|
|
@verbatim
|
|
To: Thomas Hadig <hadig@toots.physik.rwth-aachen.de>
|
|
Subject: Re: Flex Bug ?
|
|
In-reply-to: Your message of Tue, 26 Nov 1996 14:35:01 PST.
|
|
Date: Tue, 26 Nov 1996 11:15:05 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> In my lexer code, i have the line :
|
|
> ^\*.* { }
|
|
>
|
|
> Thus all lines starting with an astrix (*) are comment lines.
|
|
> This does not work !
|
|
|
|
I can't get this problem to reproduce - it works fine for me. Note
|
|
though that if what you have is slightly different:
|
|
|
|
COMMENT ^\*.*
|
|
%%
|
|
{COMMENT} { }
|
|
|
|
then it won't work, because flex pushes back macro definitions enclosed
|
|
in ()'s, so the rule becomes
|
|
|
|
(^\*.*) { }
|
|
|
|
and now that the '^' operator is not at the immediate beginning of the
|
|
line, it's interpreted as just a regular character. You can avoid this
|
|
behavior by using the "-l" lex-compatibility flag, or "%option lex-compat".
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node Trailing context is getting confused with trailing optional patterns
|
|
@unnumberedsec Trailing context is getting confused with trailing optional patterns
|
|
@example
|
|
@verbatim
|
|
To: Adoram Rogel <adoram@hybridge.com>
|
|
Subject: Re: Flex 2.5.4 BOF ???
|
|
In-reply-to: Your message of Tue, 26 Nov 1996 16:10:41 PST.
|
|
Date: Wed, 27 Nov 1996 10:56:25 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> Organization(s)?/[a-z]
|
|
>
|
|
> This matched "Organizations" (looking in debug mode, the trailing s
|
|
> was matched with trailing context instead of the optional (s) in the
|
|
> end of the word.
|
|
|
|
That should only happen with lex. Flex can properly match this pattern.
|
|
(That might be what you're saying, I'm just not sure.)
|
|
|
|
> Is there a way to avoid this dangerous trailing context problem ?
|
|
|
|
Unfortunately, there's no easy way. On the other hand, I don't see why
|
|
it should be a problem. Lex's matching is clearly wrong, and I'd hope
|
|
that usually the intent remains the same as expressed with the pattern,
|
|
so flex's matching will be correct.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node Is flex GNU or not?
|
|
@unnumberedsec Is flex GNU or not?
|
|
@example
|
|
@verbatim
|
|
To: Cameron MacKinnon <mackin@interlog.com>
|
|
Subject: Re: Flex documentation bug
|
|
In-reply-to: Your message of Mon, 02 Dec 1996 00:07:08 PST.
|
|
Date: Sun, 01 Dec 1996 22:29:39 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> I'm not sure how or where to submit bug reports (documentation or
|
|
> otherwise) for the GNU project stuff ...
|
|
|
|
Well, strictly speaking flex isn't part of the GNU project. They just
|
|
distribute it because no one's written a decent GPL'd lex replacement.
|
|
So you should send bugs directly to me. Those sent to the GNU folks
|
|
sometimes find there way to me, but some may drop between the cracks.
|
|
|
|
> In GNU Info, under the section 'Start Conditions', and also in the man
|
|
> page (mine's dated April '95) is a nice little snippet showing how to
|
|
> parse C quoted strings into a buffer, defined to be MAX_STR_CONST in
|
|
> size. Unfortunately, no overflow checking is ever done ...
|
|
|
|
This is already mentioned in the manual:
|
|
|
|
Finally, here's an example of how to match C-style quoted
|
|
strings using exclusive start conditions, including expanded
|
|
escape sequences (but not including checking for a string
|
|
that's too long):
|
|
|
|
The reason for not doing the overflow checking is that it will needlessly
|
|
clutter up an example whose main purpose is just to demonstrate how to
|
|
use flex.
|
|
|
|
The latest release is 2.5.4, by the way, available from ftp.ee.lbl.gov.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node ERASEME53
|
|
@unnumberedsec ERASEME53
|
|
@example
|
|
@verbatim
|
|
To: tsv@cs.UManitoba.CA
|
|
Subject: Re: Flex (reg)..
|
|
In-reply-to: Your message of Thu, 06 Mar 1997 23:50:16 PST.
|
|
Date: Thu, 06 Mar 1997 15:54:19 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> [:alpha:] ([:alnum:] | \\_)*
|
|
|
|
If your rule really has embedded blanks as shown above, then it won't
|
|
work, as the first blank delimits the rule from the action. (It wouldn't
|
|
even compile ...) You need instead:
|
|
|
|
[:alpha:]([:alnum:]|\\_)*
|
|
|
|
and that should work fine - there's no restriction on what can go inside
|
|
of ()'s except for the trailing context operator, '/'.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node I need to scan if-then-else blocks and while loops
|
|
@unnumberedsec I need to scan if-then-else blocks and while loops
|
|
@example
|
|
@verbatim
|
|
To: "Mike Stolnicki" <mstolnic@ford.com>
|
|
Subject: Re: FLEX help
|
|
In-reply-to: Your message of Fri, 30 May 1997 13:33:27 PDT.
|
|
Date: Fri, 30 May 1997 10:46:35 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> We'd like to add "if-then-else", "while", and "for" statements to our
|
|
> language ...
|
|
> We've investigated many possible solutions. The one solution that seems
|
|
> the most reasonable involves knowing the position of a TOKEN in yyin.
|
|
|
|
I strongly advise you to instead build a parse tree (abstract syntax tree)
|
|
and loop over that instead. You'll find this has major benefits in keeping
|
|
your interpreter simple and extensible.
|
|
|
|
That said, the functionality you mention for get_position and set_position
|
|
have been on the to-do list for a while. As flex is a purely spare-time
|
|
project for me, no guarantees when this will be added (in particular, it
|
|
for sure won't be for many months to come).
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node ERASEME55
|
|
@unnumberedsec ERASEME55
|
|
@example
|
|
@verbatim
|
|
To: Colin Paul Adams <colin@colina.demon.co.uk>
|
|
Subject: Re: Flex C++ classes and Bison
|
|
In-reply-to: Your message of 09 Aug 1997 17:11:41 PDT.
|
|
Date: Fri, 15 Aug 1997 10:48:19 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> #define YY_DECL int yylex (YYSTYPE *lvalp, struct parser_control
|
|
> *parm)
|
|
>
|
|
> I have been trying to get this to work as a C++ scanner, but it does
|
|
> not appear to be possible (warning that it matches no declarations in
|
|
> yyFlexLexer, or something like that).
|
|
>
|
|
> Is this supposed to be possible, or is it being worked on (I DID
|
|
> notice the comment that scanner classes are still experimental, so I'm
|
|
> not too hopeful)?
|
|
|
|
What you need to do is derive a subclass from yyFlexLexer that provides
|
|
the above yylex() method, squirrels away lvalp and parm into member
|
|
variables, and then invokes yyFlexLexer::yylex() to do the regular scanning.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node ERASEME56
|
|
@unnumberedsec ERASEME56
|
|
@example
|
|
@verbatim
|
|
To: Mikael.Latvala@lmf.ericsson.se
|
|
Subject: Re: Possible mistake in Flex v2.5 document
|
|
In-reply-to: Your message of Fri, 05 Sep 1997 16:07:24 PDT.
|
|
Date: Fri, 05 Sep 1997 10:01:54 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> In that example you show how to count comment lines when using
|
|
> C style /* ... */ comments. My question is, shouldn't you take into
|
|
> account a scenario where end of a comment marker occurs inside
|
|
> character or string literals?
|
|
|
|
The scanner certainly needs to also scan character and string literals.
|
|
However it does that (there's an example in the man page for strings), the
|
|
lexer will recognize the beginning of the literal before it runs across the
|
|
embedded "/*". Consequently, it will finish scanning the literal before it
|
|
even considers the possibility of matching "/*".
|
|
|
|
Example:
|
|
|
|
'([^']*|{ESCAPE_SEQUENCE})'
|
|
|
|
will match all the text between the ''s (inclusive). So the lexer
|
|
considers this as a token beginning at the first ', and doesn't even
|
|
attempt to match other tokens inside it.
|
|
|
|
I thinnk this subtlety is not worth putting in the manual, as I suspect
|
|
it would confuse more people than it would enlighten.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node ERASEME57
|
|
@unnumberedsec ERASEME57
|
|
@example
|
|
@verbatim
|
|
To: "Marty Leisner" <leisner@sdsp.mc.xerox.com>
|
|
Subject: Re: flex limitations
|
|
In-reply-to: Your message of Sat, 06 Sep 1997 11:27:21 PDT.
|
|
Date: Mon, 08 Sep 1997 11:38:08 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> %%
|
|
> [a-zA-Z]+ /* skip a line */
|
|
> { printf("got %s\n", yytext); }
|
|
> %%
|
|
|
|
What version of flex are you using? If I feed this to 2.5.4, it complains:
|
|
|
|
"bug.l", line 5: EOF encountered inside an action
|
|
"bug.l", line 5: unrecognized rule
|
|
"bug.l", line 5: fatal parse error
|
|
|
|
Not the world's greatest error message, but it manages to flag the problem.
|
|
|
|
(With the introduction of start condition scopes, flex can't accommodate
|
|
an action on a separate line, since it's ambiguous with an indented rule.)
|
|
|
|
You can get 2.5.4 from ftp.ee.lbl.gov.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node Is there a repository for flex scanners?
|
|
@unnumberedsec Is there a repository for flex scanners?
|
|
|
|
Not that we know of. You might try asking on comp.compilers.
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node How can I conditionally compile or preprocess my flex input file?
|
|
@unnumberedsec How can I conditionally compile or preprocess my flex input file?
|
|
|
|
|
|
Flex doesn't have a preprocessor like C does. You might try using m4, or the C
|
|
preprocessor plus a sed script to clean up the result.
|
|
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node Where can I find grammars for lex and yacc?
|
|
@unnumberedsec Where can I find grammars for lex and yacc?
|
|
|
|
In the sources for flex and bison.
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node I get an end-of-buffer message for each character scanned.
|
|
@unnumberedsec I get an end-of-buffer message for each character scanned.
|
|
|
|
This will happen if your LexerInput() function returns only one character
|
|
at a time, which can happen either if you're scanner is "interactive", or
|
|
if the streams library on your platform always returns 1 for yyin->gcount().
|
|
|
|
Solution: override LexerInput() with a version that returns whole buffers.
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-62
|
|
@unnumberedsec unnamed-faq-62
|
|
@example
|
|
@verbatim
|
|
To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE
|
|
Subject: Re: Flex maximums
|
|
In-reply-to: Your message of Mon, 17 Nov 1997 17:16:06 PST.
|
|
Date: Mon, 17 Nov 1997 17:16:15 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> I took a quick look into the flex-sources and altered some #defines in
|
|
> flexdefs.h:
|
|
>
|
|
> #define INITIAL_MNS 64000
|
|
> #define MNS_INCREMENT 1024000
|
|
> #define MAXIMUM_MNS 64000
|
|
|
|
The things to fix are to add a couple of zeroes to:
|
|
|
|
#define JAMSTATE -32766 /* marks a reference to the state that always jams */
|
|
#define MAXIMUM_MNS 31999
|
|
#define BAD_SUBSCRIPT -32767
|
|
#define MAX_SHORT 32700
|
|
|
|
and, if you get complaints about too many rules, make the following change too:
|
|
|
|
#define YY_TRAILING_MASK 0x200000
|
|
#define YY_TRAILING_HEAD_MASK 0x400000
|
|
|
|
- Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-63
|
|
@unnumberedsec unnamed-faq-63
|
|
@example
|
|
@verbatim
|
|
To: jimmey@lexis-nexis.com (Jimmey Todd)
|
|
Subject: Re: FLEX question regarding istream vs ifstream
|
|
In-reply-to: Your message of Mon, 08 Dec 1997 15:54:15 PST.
|
|
Date: Mon, 15 Dec 1997 13:21:35 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> stdin_handle = YY_CURRENT_BUFFER;
|
|
> ifstream fin( "aFile" );
|
|
> yy_switch_to_buffer( yy_create_buffer( fin, YY_BUF_SIZE ) );
|
|
>
|
|
> What I'm wanting to do, is pass the contents of a file thru one set
|
|
> of rules and then pass stdin thru another set... It works great if, I
|
|
> don't use the C++ classes. But since everything else that I'm doing is
|
|
> in C++, I thought I'd be consistent.
|
|
>
|
|
> The problem is that 'yy_create_buffer' is expecting an istream* as it's
|
|
> first argument (as stated in the man page). However, fin is a ifstream
|
|
> object. Any ideas on what I might be doing wrong? Any help would be
|
|
> appreciated. Thanks!!
|
|
|
|
You need to pass &fin, to turn it into an ifstream* instead of an ifstream.
|
|
Then its type will be compatible with the expected istream*, because ifstream
|
|
is derived from istream.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-64
|
|
@unnumberedsec unnamed-faq-64
|
|
@example
|
|
@verbatim
|
|
To: Enda Fadian <fadiane@piercom.ie>
|
|
Subject: Re: Question related to Flex man page?
|
|
In-reply-to: Your message of Tue, 16 Dec 1997 15:17:34 PST.
|
|
Date: Tue, 16 Dec 1997 14:17:09 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> Can you explain to me what is ment by a long-jump in relation to flex?
|
|
|
|
Using the longjmp() function while inside yylex() or a routine called by it.
|
|
|
|
> what is the flex activation frame.
|
|
|
|
Just yylex()'s stack frame.
|
|
|
|
> As far as I can see yyrestart will bring me back to the sart of the input
|
|
> file and using flex++ isnot really an option!
|
|
|
|
No, yyrestart() doesn't imply a rewind, even though its name might sound
|
|
like it does. It tells the scanner to flush its internal buffers and
|
|
start reading from the given file at its present location.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-65
|
|
@unnumberedsec unnamed-faq-65
|
|
@example
|
|
@verbatim
|
|
To: hassan@larc.info.uqam.ca (Hassan Alaoui)
|
|
Subject: Re: Need urgent Help
|
|
In-reply-to: Your message of Sat, 20 Dec 1997 19:38:19 PST.
|
|
Date: Sun, 21 Dec 1997 21:30:46 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> /usr/lib/yaccpar: In function `int yyparse()':
|
|
> /usr/lib/yaccpar:184: warning: implicit declaration of function `int yylex(...)'
|
|
>
|
|
> ld: Undefined symbol
|
|
> _yylex
|
|
> _yyparse
|
|
> _yyin
|
|
|
|
This is a known problem with Solaris C++ (and/or Solaris yacc). I believe
|
|
the fix is to explicitly insert some 'extern "C"' statements for the
|
|
corresponding routines/symbols.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-66
|
|
@unnumberedsec unnamed-faq-66
|
|
@example
|
|
@verbatim
|
|
To: mc0307@mclink.it
|
|
Cc: gnu@prep.ai.mit.edu
|
|
Subject: Re: [mc0307@mclink.it: Help request]
|
|
In-reply-to: Your message of Fri, 12 Dec 1997 17:57:29 PST.
|
|
Date: Sun, 21 Dec 1997 22:33:37 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> This is my definition for float and integer types:
|
|
> . . .
|
|
> NZD [1-9]
|
|
> ...
|
|
> I've tested my program on other lex version (on UNIX Sun Solaris an HP
|
|
> UNIX) and it work well, so I think that my definitions are correct.
|
|
> There are any differences between Lex and Flex?
|
|
|
|
There are indeed differences, as discussed in the man page. The one
|
|
you are probably running into is that when flex expands a name definition,
|
|
it puts parentheses around the expansion, while lex does not. There's
|
|
an example in the man page of how this can lead to different matching.
|
|
Flex's behavior complies with the POSIX standard (or at least with the
|
|
last POSIX draft I saw).
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-67
|
|
@unnumberedsec unnamed-faq-67
|
|
@example
|
|
@verbatim
|
|
To: hassan@larc.info.uqam.ca (Hassan Alaoui)
|
|
Subject: Re: Thanks
|
|
In-reply-to: Your message of Mon, 22 Dec 1997 16:06:35 PST.
|
|
Date: Mon, 22 Dec 1997 14:35:05 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> Thank you very much for your help. I compile and link well with C++ while
|
|
> declaring 'yylex ...' extern, But a little problem remains. I get a
|
|
> segmentation default when executing ( I linked with lfl library) while it
|
|
> works well when using LEX instead of flex. Do you have some ideas about the
|
|
> reason for this ?
|
|
|
|
The one possible reason for this that comes to mind is if you've defined
|
|
yytext as "extern char yytext[]" (which is what lex uses) instead of
|
|
"extern char *yytext" (which is what flex uses). If it's not that, then
|
|
I'm afraid I don't know what the problem might be.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-68
|
|
@unnumberedsec unnamed-faq-68
|
|
@example
|
|
@verbatim
|
|
To: "Bart Niswonger" <NISWONGR@almaden.ibm.com>
|
|
Subject: Re: flex 2.5: c++ scanners & start conditions
|
|
In-reply-to: Your message of Tue, 06 Jan 1998 10:34:21 PST.
|
|
Date: Tue, 06 Jan 1998 19:19:30 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> The problem is that when I do this (using %option c++) start
|
|
> conditions seem to not apply.
|
|
|
|
The BEGIN macro modifies the yy_start variable. For C scanners, this
|
|
is a static with scope visible through the whole file. For C++ scanners,
|
|
it's a member variable, so it only has visible scope within a member
|
|
function. Your lexbegin() routine is not a member function when you
|
|
build a C++ scanner, so it's not modifying the correct yy_start. The
|
|
diagnostic that indicates this is that you found you needed to add
|
|
a declaration of yy_start in order to get your scanner to compile when
|
|
using C++; instead, the correct fix is to make lexbegin() a member
|
|
function (by deriving from yyFlexLexer).
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-69
|
|
@unnumberedsec unnamed-faq-69
|
|
@example
|
|
@verbatim
|
|
To: "Boris Zinin" <boris@ippe.rssi.ru>
|
|
Subject: Re: current position in flex buffer
|
|
In-reply-to: Your message of Mon, 12 Jan 1998 18:58:23 PST.
|
|
Date: Mon, 12 Jan 1998 12:03:15 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> The problem is how to determine the current position in flex active
|
|
> buffer when a rule is matched....
|
|
|
|
You will need to keep track of this explicitly, such as by redefining
|
|
YY_USER_ACTION to count the number of characters matched.
|
|
|
|
The latest flex release, by the way, is 2.5.4, available from ftp.ee.lbl.gov.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-70
|
|
@unnumberedsec unnamed-faq-70
|
|
@example
|
|
@verbatim
|
|
To: Bik.Dhaliwal@bis.org
|
|
Subject: Re: Flex question
|
|
In-reply-to: Your message of Mon, 26 Jan 1998 13:05:35 PST.
|
|
Date: Tue, 27 Jan 1998 22:41:52 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> That requirement involves knowing
|
|
> the character position at which a particular token was matched
|
|
> in the lexer.
|
|
|
|
The way you have to do this is by explicitly keeping track of where
|
|
you are in the file, by counting the number of characters scanned
|
|
for each token (available in yyleng). It may prove convenient to
|
|
do this by redefining YY_USER_ACTION, as described in the manual.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-71
|
|
@unnumberedsec unnamed-faq-71
|
|
@example
|
|
@verbatim
|
|
To: Vladimir Alexiev <vladimir@cs.ualberta.ca>
|
|
Subject: Re: flex: how to control start condition from parser?
|
|
In-reply-to: Your message of Mon, 26 Jan 1998 05:50:16 PST.
|
|
Date: Tue, 27 Jan 1998 22:45:37 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> It seems useful for the parser to be able to tell the lexer about such
|
|
> context dependencies, because then they don't have to be limited to
|
|
> local or sequential context.
|
|
|
|
One way to do this is to have the parser call a stub routine that's
|
|
included in the scanner's .l file, and consequently that has access ot
|
|
BEGIN. The only ugliness is that the parser can't pass in the state
|
|
it wants, because those aren't visible - but if you don't have many
|
|
such states, then using a different set of names doesn't seem like
|
|
to much of a burden.
|
|
|
|
While generating a .h file like you suggests is certainly cleaner,
|
|
flex development has come to a virtual stand-still :-(, so a workaround
|
|
like the above is much more pragmatic than waiting for a new feature.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-72
|
|
@unnumberedsec unnamed-faq-72
|
|
@example
|
|
@verbatim
|
|
To: Barbara Denny <denny@3com.com>
|
|
Subject: Re: freebsd flex bug?
|
|
In-reply-to: Your message of Fri, 30 Jan 1998 12:00:43 PST.
|
|
Date: Fri, 30 Jan 1998 12:42:32 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> lex.yy.c:1996: parse error before `='
|
|
|
|
This is the key, identifying this error. (It may help to pinpoint
|
|
it by using flex -L, so it doesn't generate #line directives in its
|
|
output.) I will bet you heavy money that you have a start condition
|
|
name that is also a variable name, or something like that; flex spits
|
|
out #define's for each start condition name, mapping them to a number,
|
|
so you can wind up with:
|
|
|
|
%x foo
|
|
%%
|
|
...
|
|
%%
|
|
void bar()
|
|
{
|
|
int foo = 3;
|
|
}
|
|
|
|
and the penultimate will turn into "int 1 = 3" after C preprocessing,
|
|
since flex will put "#define foo 1" in the generated scanner.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-73
|
|
@unnumberedsec unnamed-faq-73
|
|
@example
|
|
@verbatim
|
|
To: Maurice Petrie <mpetrie@infoscigroup.com>
|
|
Subject: Re: Lost flex .l file
|
|
In-reply-to: Your message of Mon, 02 Feb 1998 14:10:01 PST.
|
|
Date: Mon, 02 Feb 1998 11:15:12 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> I am curious as to
|
|
> whether there is a simple way to backtrack from the generated source to
|
|
> reproduce the lost list of tokens we are searching on.
|
|
|
|
In theory, it's straight-forward to go from the DFA representation
|
|
back to a regular-expression representation - the two are isomorphic.
|
|
In practice, a huge headache, because you have to unpack all the tables
|
|
back into a single DFA representation, and then write a program to munch
|
|
on that and translate it into an RE.
|
|
|
|
Sorry for the less-than-happy news ...
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-74
|
|
@unnumberedsec unnamed-faq-74
|
|
@example
|
|
@verbatim
|
|
To: jimmey@lexis-nexis.com (Jimmey Todd)
|
|
Subject: Re: Flex performance question
|
|
In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST.
|
|
Date: Thu, 19 Feb 1998 08:48:51 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> What I have found, is that the smaller the data chunk, the faster the
|
|
> program executes. This is the opposite of what I expected. Should this be
|
|
> happening this way?
|
|
|
|
This is exactly what will happen if your input file has embedded NULs.
|
|
From the man page:
|
|
|
|
A final note: flex is slow when matching NUL's, particularly
|
|
when a token contains multiple NUL's. It's best to write
|
|
rules which match short amounts of text if it's anticipated
|
|
that the text will often include NUL's.
|
|
|
|
So that's the first thing to look for.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-75
|
|
@unnumberedsec unnamed-faq-75
|
|
@example
|
|
@verbatim
|
|
To: jimmey@lexis-nexis.com (Jimmey Todd)
|
|
Subject: Re: Flex performance question
|
|
In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST.
|
|
Date: Thu, 19 Feb 1998 15:42:25 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
So there are several problems.
|
|
|
|
First, to go fast, you want to match as much text as possible, which
|
|
your scanners don't in the case that what they're scanning is *not*
|
|
a <RN> tag. So you want a rule like:
|
|
|
|
[^<]+
|
|
|
|
Second, C++ scanners are particularly slow if they're interactive,
|
|
which they are by default. Using -B speeds it up by a factor of 3-4
|
|
on my workstation.
|
|
|
|
Third, C++ scanners that use the istream interface are slow, because
|
|
of how poorly implemented istream's are. I built two versions of
|
|
the following scanner:
|
|
|
|
%%
|
|
.*\n
|
|
.*
|
|
%%
|
|
|
|
and the C version inhales a 2.5MB file on my workstation in 0.8 seconds.
|
|
The C++ istream version, using -B, takes 3.8 seconds.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-76
|
|
@unnumberedsec unnamed-faq-76
|
|
@example
|
|
@verbatim
|
|
To: "Frescatore, David (CRD, TAD)" <frescatore@exc01crdge.crd.ge.com>
|
|
Subject: Re: FLEX 2.5 & THE YEAR 2000
|
|
In-reply-to: Your message of Wed, 03 Jun 1998 11:26:22 PDT.
|
|
Date: Wed, 03 Jun 1998 10:22:26 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> I am researching the Y2K problem with General Electric R&D
|
|
> and need to know if there are any known issues concerning
|
|
> the above mentioned software and Y2K regardless of version.
|
|
|
|
There shouldn't be, all it ever does with the date is ask the system
|
|
for it and then print it out.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-77
|
|
@unnumberedsec unnamed-faq-77
|
|
@example
|
|
@verbatim
|
|
To: "Hans Dermot Doran" <htd@ibhdoran.com>
|
|
Subject: Re: flex problem
|
|
In-reply-to: Your message of Wed, 15 Jul 1998 21:30:13 PDT.
|
|
Date: Tue, 21 Jul 1998 14:23:34 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> To overcome this, I gets() the stdin into a string and lex the string. The
|
|
> string is lexed OK except that the end of string isn't lexed properly
|
|
> (yy_scan_string()), that is the lexer dosn't recognise the end of string.
|
|
|
|
Flex doesn't contain mechanisms for recognizing buffer endpoints. But if
|
|
you use fgets instead (which you should anyway, to protect against buffer
|
|
overflows), then the final \n will be preserved in the string, and you can
|
|
scan that in order to find the end of the string.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-78
|
|
@unnumberedsec unnamed-faq-78
|
|
@example
|
|
@verbatim
|
|
To: soumen@almaden.ibm.com
|
|
Subject: Re: Flex++ 2.5.3 instance member vs. static member
|
|
In-reply-to: Your message of Mon, 27 Jul 1998 02:10:04 PDT.
|
|
Date: Tue, 28 Jul 1998 01:10:34 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> %{
|
|
> int mylineno = 0;
|
|
> %}
|
|
> ws [ \t]+
|
|
> alpha [A-Za-z]
|
|
> dig [0-9]
|
|
> %%
|
|
>
|
|
> Now you'd expect mylineno to be a member of each instance of class
|
|
> yyFlexLexer, but is this the case? A look at the lex.yy.cc file seems to
|
|
> indicate otherwise; unless I am missing something the declaration of
|
|
> mylineno seems to be outside any class scope.
|
|
>
|
|
> How will this work if I want to run a multi-threaded application with each
|
|
> thread creating a FlexLexer instance?
|
|
|
|
Derive your own subclass and make mylineno a member variable of it.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-79
|
|
@unnumberedsec unnamed-faq-79
|
|
@example
|
|
@verbatim
|
|
To: Adoram Rogel <adoram@hybridge.com>
|
|
Subject: Re: More than 32K states change hangs
|
|
In-reply-to: Your message of Tue, 04 Aug 1998 16:55:39 PDT.
|
|
Date: Tue, 04 Aug 1998 22:28:45 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> Vern Paxson,
|
|
>
|
|
> I followed your advice, posted on Usenet bu you, and emailed to me
|
|
> personally by you, on how to overcome the 32K states limit. I'm running
|
|
> on Linux machines.
|
|
> I took the full source of version 2.5.4 and did the following changes in
|
|
> flexdef.h:
|
|
> #define JAMSTATE -327660
|
|
> #define MAXIMUM_MNS 319990
|
|
> #define BAD_SUBSCRIPT -327670
|
|
> #define MAX_SHORT 327000
|
|
>
|
|
> and compiled.
|
|
> All looked fine, including check and bigcheck, so I installed.
|
|
|
|
Hmmm, you shouldn't increase MAX_SHORT, though looking through my email
|
|
archives I see that I did indeed recommend doing so. Try setting it back
|
|
to 32700; that should suffice that you no longer need -Ca. If it still
|
|
hangs, then the interesting question is - where?
|
|
|
|
> Compiling the same hanged program with a out-of-the-box (RedHat 4.2
|
|
> distribution of Linux)
|
|
> flex 2.5.4 binary works.
|
|
|
|
Since Linux comes with source code, you should diff it against what
|
|
you have to see what problems they missed.
|
|
|
|
> Should I always compile with the -Ca option now ? even short and simple
|
|
> filters ?
|
|
|
|
No, definitely not. It's meant to be for those situations where you
|
|
absolutely must squeeze every last cycle out of your scanner.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-80
|
|
@unnumberedsec unnamed-faq-80
|
|
@example
|
|
@verbatim
|
|
To: "Schmackpfeffer, Craig" <Craig.Schmackpfeffer@usa.xerox.com>
|
|
Subject: Re: flex output for static code portion
|
|
In-reply-to: Your message of Tue, 11 Aug 1998 11:55:30 PDT.
|
|
Date: Mon, 17 Aug 1998 23:57:42 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> I would like to use flex under the hood to generate a binary file
|
|
> containing the data structures that control the parse.
|
|
|
|
This has been on the wish-list for a long time. In principle it's
|
|
straight-forward - you redirect mkdata() et al's I/O to another file,
|
|
and modify the skeleton to have a start-up function that slurps these
|
|
into dynamic arrays. The concerns are (1) the scanner generation code
|
|
is hairy and full of corner cases, so it's easy to get surprised when
|
|
going down this path :-( ; and (2) being careful about buffering so
|
|
that when the tables change you make sure the scanner starts in the
|
|
correct state and reading at the right point in the input file.
|
|
|
|
> I was wondering if you know of anyone who has used flex in this way.
|
|
|
|
I don't - but it seems like a reasonable project to undertake (unlike
|
|
numerous other flex tweaks :-).
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-81
|
|
@unnumberedsec unnamed-faq-81
|
|
@example
|
|
@verbatim
|
|
Received: from 131.173.17.11 (131.173.17.11 [131.173.17.11])
|
|
by ee.lbl.gov (8.9.1/8.9.1) with ESMTP id AAA03838
|
|
for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 00:47:57 -0700 (PDT)
|
|
Received: from hal.cl-ki.uni-osnabrueck.de (hal.cl-ki.Uni-Osnabrueck.DE [131.173.141.2])
|
|
by deimos.rz.uni-osnabrueck.de (8.8.7/8.8.8) with ESMTP id JAA34694
|
|
for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 09:47:55 +0200
|
|
Received: (from georg@localhost) by hal.cl-ki.uni-osnabrueck.de (8.6.12/8.6.12) id JAA34834 for vern@ee.lbl.gov; Thu, 20 Aug 1998 09:47:54 +0200
|
|
From: Georg Rehm <georg@hal.cl-ki.uni-osnabrueck.de>
|
|
Message-Id: <199808200747.JAA34834@hal.cl-ki.uni-osnabrueck.de>
|
|
Subject: "flex scanner push-back overflow"
|
|
To: vern@ee.lbl.gov
|
|
Date: Thu, 20 Aug 1998 09:47:54 +0200 (MEST)
|
|
Reply-To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE
|
|
X-NoJunk: Do NOT send commercial mail, spam or ads to this address!
|
|
X-URL: http://www.cl-ki.uni-osnabrueck.de/~georg/
|
|
X-Mailer: ELM [version 2.4ME+ PL28 (25)]
|
|
MIME-Version: 1.0
|
|
Content-Type: text/plain; charset=US-ASCII
|
|
Content-Transfer-Encoding: 7bit
|
|
|
|
Hi Vern,
|
|
|
|
Yesterday, I encountered a strange problem: I use the macro processor m4
|
|
to include some lengthy lists into a .l file. Following is a flex macro
|
|
definition that causes some serious pain in my neck:
|
|
|
|
AUTHOR ("A. Boucard / L. Boucard"|"A. Dastarac / M. Levent"|"A.Boucaud / L.Boucaud"|"Abderrahim Lamchichi"|"Achmat Dangor"|"Adeline Toullier"|"Adewale Maja-Pearce"|"Ahmed Ziri"|"Akram Ellyas"|"Alain Bihr"|"Alain Gresh"|"Alain Guillemoles"|"Alain Joxe"|"Alain Morice"|"Alain Renon"|"Alain Zecchini"|"Albert Memmi"|"Alberto Manguel"|"Alex De Waal"|"Alfonso Artico"| [...])
|
|
|
|
The complete list contains about 10kB. When I try to "flex" this file
|
|
(on a Solaris 2.6 machine, using a modified flex 2.5.4 (I only increased
|
|
some of the predefined values in flexdefs.h) I get the error:
|
|
|
|
myflex/flex -8 sentag.tmp.l
|
|
flex scanner push-back overflow
|
|
|
|
When I remove the slashes in the macro definition everything works fine.
|
|
As I understand it, the double quotes escape the slash-character so it
|
|
really means "/" and not "trailing context". Furthermore, I tried to
|
|
escape the slashes with backslashes, but with no use, the same error message
|
|
appeared when flexing the code.
|
|
|
|
Do you have an idea what's going on here?
|
|
|
|
Greetings from Germany,
|
|
Georg
|
|
--
|
|
Georg Rehm georg@cl-ki.uni-osnabrueck.de
|
|
Institute for Semantic Information Processing, University of Osnabrueck, FRG
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-82
|
|
@unnumberedsec unnamed-faq-82
|
|
@example
|
|
@verbatim
|
|
To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE
|
|
Subject: Re: "flex scanner push-back overflow"
|
|
In-reply-to: Your message of Thu, 20 Aug 1998 09:47:54 PDT.
|
|
Date: Thu, 20 Aug 1998 07:05:35 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> myflex/flex -8 sentag.tmp.l
|
|
> flex scanner push-back overflow
|
|
|
|
Flex itself uses a flex scanner. That scanner is running out of buffer
|
|
space when it tries to unput() the humongous macro you've defined. When
|
|
you remove the '/'s, you make it small enough so that it fits in the buffer;
|
|
removing spaces would do the same thing.
|
|
|
|
The fix is to either rethink how come you're using such a big macro and
|
|
perhaps there's another/better way to do it; or to rebuild flex's own
|
|
scan.c with a larger value for
|
|
|
|
#define YY_BUF_SIZE 16384
|
|
|
|
- Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-83
|
|
@unnumberedsec unnamed-faq-83
|
|
@example
|
|
@verbatim
|
|
To: Jan Kort <jan@research.techforce.nl>
|
|
Subject: Re: Flex
|
|
In-reply-to: Your message of Fri, 04 Sep 1998 12:18:43 +0200.
|
|
Date: Sat, 05 Sep 1998 00:59:49 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> %%
|
|
>
|
|
> "TEST1\n" { fprintf(stderr, "TEST1\n"); yyless(5); }
|
|
> ^\n { fprintf(stderr, "empty line\n"); }
|
|
> . { }
|
|
> \n { fprintf(stderr, "new line\n"); }
|
|
>
|
|
> %%
|
|
> -- input ---------------------------------------
|
|
> TEST1
|
|
> -- output --------------------------------------
|
|
> TEST1
|
|
> empty line
|
|
> ------------------------------------------------
|
|
|
|
IMHO, it's not clear whether or not this is in fact a bug. It depends
|
|
on whether you view yyless() as backing up in the input stream, or as
|
|
pushing new characters onto the beginning of the input stream. Flex
|
|
interprets it as the latter (for implementation convenience, I'll admit),
|
|
and so considers the newline as in fact matching at the beginning of a
|
|
line, as after all the last token scanned an entire line and so the
|
|
scanner is now at the beginning of a new line.
|
|
|
|
I agree that this is counter-intuitive for yyless(), given its
|
|
functional description (it's less so for unput(), depending on whether
|
|
you're unput()'ing new text or scanned text). But I don't plan to
|
|
change it any time soon, as it's a pain to do so. Consequently,
|
|
you do indeed need to use yy_set_bol() and YY_AT_BOL() to tweak
|
|
your scanner into the behavior you desire.
|
|
|
|
Sorry for the less-than-completely-satisfactory answer.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-84
|
|
@unnumberedsec unnamed-faq-84
|
|
@example
|
|
@verbatim
|
|
To: Patrick Krusenotto <krusenot@mac-info-link.de>
|
|
Subject: Re: Problems with restarting flex-2.5.2-generated scanner
|
|
In-reply-to: Your message of Thu, 24 Sep 1998 10:14:07 PDT.
|
|
Date: Thu, 24 Sep 1998 23:28:43 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> I am using flex-2.5.2 and bison 1.25 for Solaris and I am desperately
|
|
> trying to make my scanner restart with a new file after my parser stops
|
|
> with a parse error. When my compiler restarts, the parser always
|
|
> receives the token after the token (in the old file!) that caused the
|
|
> parser error.
|
|
|
|
I suspect the problem is that your parser has read ahead in order
|
|
to attempt to resolve an ambiguity, and when it's restarted it picks
|
|
up with that token rather than reading a fresh one. If you're using
|
|
yacc, then the special "error" production can sometimes be used to
|
|
consume tokens in an attempt to get the parser into a consistent state.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-85
|
|
@unnumberedsec unnamed-faq-85
|
|
@example
|
|
@verbatim
|
|
To: Henric Jungheim <junghelh@pe-nelson.com>
|
|
Subject: Re: flex 2.5.4a
|
|
In-reply-to: Your message of Tue, 27 Oct 1998 16:41:42 PST.
|
|
Date: Tue, 27 Oct 1998 16:50:14 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> This brings up a feature request: How about a command line
|
|
> option to specify the filename when reading from stdin? That way one
|
|
> doesn't need to create a temporary file in order to get the "#line"
|
|
> directives to make sense.
|
|
|
|
Use -o combined with -t (per the man page description of -o).
|
|
|
|
> P.S., Is there any simple way to use non-blocking IO to parse multiple
|
|
> streams?
|
|
|
|
Simple, no.
|
|
|
|
One approach might be to return a magic character on EWOULDBLOCK and
|
|
have a rule
|
|
|
|
.*<magic-character> // put back .*, eat magic character
|
|
|
|
This is off the top of my head, not sure it'll work.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-86
|
|
@unnumberedsec unnamed-faq-86
|
|
@example
|
|
@verbatim
|
|
To: "Repko, Billy D" <billy.d.repko@intel.com>
|
|
Subject: Re: Compiling scanners
|
|
In-reply-to: Your message of Wed, 13 Jan 1999 10:52:47 PST.
|
|
Date: Thu, 14 Jan 1999 00:25:30 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> It appears that maybe it cannot find the lfl library.
|
|
|
|
The Makefile in the distribution builds it, so you should have it.
|
|
It's exceedingly trivial, just a main() that calls yylex() and
|
|
a yyrap() that always returns 1.
|
|
|
|
> %%
|
|
> \n ++num_lines; ++num_chars;
|
|
> . ++num_chars;
|
|
|
|
You can't indent your rules like this - that's where the errors are coming
|
|
from. Flex copies indented text to the output file, it's how you do things
|
|
like
|
|
|
|
int num_lines_seen = 0;
|
|
|
|
to declare local variables.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-87
|
|
@unnumberedsec unnamed-faq-87
|
|
@example
|
|
@verbatim
|
|
To: Erick Branderhorst <Erick.Branderhorst@asml.nl>
|
|
Subject: Re: flex input buffer
|
|
In-reply-to: Your message of Tue, 09 Feb 1999 13:53:46 PST.
|
|
Date: Tue, 09 Feb 1999 21:03:37 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> In the flex.skl file the size of the default input buffers is set. Can you
|
|
> explain why this size is set and why it is such a high number.
|
|
|
|
It's large to optimize performance when scanning large files. You can
|
|
safely make it a lot lower if needed.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-88
|
|
@unnumberedsec unnamed-faq-88
|
|
@example
|
|
@verbatim
|
|
To: "Guido Minnen" <guidomi@cogs.susx.ac.uk>
|
|
Subject: Re: Flex error message
|
|
In-reply-to: Your message of Wed, 24 Feb 1999 15:31:46 PST.
|
|
Date: Thu, 25 Feb 1999 00:11:31 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> I'm extending a larger scanner written in Flex and I keep running into
|
|
> problems. More specifically, I get the error message:
|
|
> "flex: input rules are too complicated (>= 32000 NFA states)"
|
|
|
|
Increase the definitions in flexdef.h for:
|
|
|
|
#define JAMSTATE -32766 /* marks a reference to the state that always j
|
|
ams */
|
|
#define MAXIMUM_MNS 31999
|
|
#define BAD_SUBSCRIPT -32767
|
|
|
|
recompile everything, and it should all work.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-90
|
|
@unnumberedsec unnamed-faq-90
|
|
@example
|
|
@verbatim
|
|
To: "Dmitriy Goldobin" <gold@ems.chel.su>
|
|
Subject: Re: FLEX trouble
|
|
In-reply-to: Your message of Mon, 31 May 1999 18:44:49 PDT.
|
|
Date: Tue, 01 Jun 1999 00:15:07 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> I have a trouble with FLEX. Why rule "/*".*"*/" work properly,=20
|
|
> but rule "/*"(.|\n)*"*/" don't work ?
|
|
|
|
The second of these will have to scan the entire input stream (because
|
|
"(.|\n)*" matches an arbitrary amount of any text) in order to see if
|
|
it ends with "*/", terminating the comment. That potentially will overflow
|
|
the input buffer.
|
|
|
|
> More complex rule "/*"([^*]|(\*/[^/]))*"*/ give an error
|
|
> 'unrecognized rule'.
|
|
|
|
You can't use the '/' operator inside parentheses. It's not clear
|
|
what "(a/b)*" actually means.
|
|
|
|
> I now use workaround with state <comment>, but single-rule is
|
|
> better, i think.
|
|
|
|
Single-rule is nice but will always have the problem of either setting
|
|
restrictions on comments (like not allowing multi-line comments) and/or
|
|
running the risk of consuming the entire input stream, as noted above.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-91
|
|
@unnumberedsec unnamed-faq-91
|
|
@example
|
|
@verbatim
|
|
Received: from mc-qout4.whowhere.com (mc-qout4.whowhere.com [209.185.123.18])
|
|
by ee.lbl.gov (8.9.3/8.9.3) with SMTP id IAA05100
|
|
for <vern@ee.lbl.gov>; Tue, 15 Jun 1999 08:56:06 -0700 (PDT)
|
|
Received: from Unknown/Local ([?.?.?.?]) by my-deja.com; Tue Jun 15 08:55:43 1999
|
|
To: vern@ee.lbl.gov
|
|
Date: Tue, 15 Jun 1999 08:55:43 -0700
|
|
From: "Aki Niimura" <neko@my-deja.com>
|
|
Message-ID: <KNONDOHDOBGAEAAA@my-deja.com>
|
|
Mime-Version: 1.0
|
|
Cc:
|
|
X-Sent-Mail: on
|
|
Reply-To:
|
|
X-Mailer: MailCity Service
|
|
Subject: A question on flex C++ scanner
|
|
X-Sender-Ip: 12.72.207.61
|
|
Organization: My Deja Email (http://www.my-deja.com:80)
|
|
Content-Type: text/plain; charset=us-ascii
|
|
Content-Transfer-Encoding: 7bit
|
|
|
|
Dear Dr. Paxon,
|
|
|
|
I have been using flex for years.
|
|
It works very well on many projects.
|
|
Most case, I used it to generate a scanner on C language.
|
|
However, one project I needed to generate a scanner
|
|
on C++ lanuage. Thanks to your enhancement, flex did
|
|
the job.
|
|
|
|
Currently, I'm working on enhancing my previous project.
|
|
I need to deal with multiple input streams (recursive
|
|
inclusion) in this scanner (C++).
|
|
I did similar thing for another scanner (C) as you
|
|
explained in your documentation.
|
|
|
|
The generated scanner (C++) has necessary methods:
|
|
- switch_to_buffer(struct yy_buffer_state *b)
|
|
- yy_create_buffer(istream *is, int sz)
|
|
- yy_delete_buffer(struct yy_buffer_state *b)
|
|
|
|
However, I couldn't figure out how to access current
|
|
buffer (yy_current_buffer).
|
|
|
|
yy_current_buffer is a protected member of yyFlexLexer.
|
|
I can't access it directly.
|
|
Then, I thought yy_create_buffer() with is = 0 might
|
|
return current stream buffer. But it seems not as far
|
|
as I checked the source. (flex 2.5.4)
|
|
|
|
I went through the Web in addition to Flex documentation.
|
|
However, it hasn't been successful, so far.
|
|
|
|
It is not my intention to bother you, but, can you
|
|
comment about how to obtain the current stream buffer?
|
|
|
|
Your response would be highly appreciated.
|
|
|
|
Best regards,
|
|
Aki Niimura
|
|
|
|
--== Sent via Deja.com http://www.deja.com/ ==--
|
|
Share what you know. Learn what you don't.
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-92
|
|
@unnumberedsec unnamed-faq-92
|
|
@example
|
|
@verbatim
|
|
To: neko@my-deja.com
|
|
Subject: Re: A question on flex C++ scanner
|
|
In-reply-to: Your message of Tue, 15 Jun 1999 08:55:43 PDT.
|
|
Date: Tue, 15 Jun 1999 09:04:24 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> However, I couldn't figure out how to access current
|
|
> buffer (yy_current_buffer).
|
|
|
|
Derive your own subclass from yyFlexLexer.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-93
|
|
@unnumberedsec unnamed-faq-93
|
|
@example
|
|
@verbatim
|
|
To: "Stones, Darren" <Darren.Stones@nectech.co.uk>
|
|
Subject: Re: You're the man to see?
|
|
In-reply-to: Your message of Wed, 23 Jun 1999 11:10:29 PDT.
|
|
Date: Wed, 23 Jun 1999 09:01:40 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> I hope you can help me. I am using Flex and Bison to produce an interpreted
|
|
> language. However all goes well until I try to implement an IF statement or
|
|
> a WHILE. I cannot get this to work as the parser parses all the conditions
|
|
> eg. the TRUE and FALSE conditons to check for a rule match. So I cannot
|
|
> make a decision!!
|
|
|
|
You need to use the parser to build a parse tree (= abstract syntax trwee),
|
|
and when that's all done you recursively evaluate the tree, binding variables
|
|
to values at that time.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-94
|
|
@unnumberedsec unnamed-faq-94
|
|
@example
|
|
@verbatim
|
|
To: Petr Danecek <petr@ics.cas.cz>
|
|
Subject: Re: flex - question
|
|
In-reply-to: Your message of Mon, 28 Jun 1999 19:21:41 PDT.
|
|
Date: Fri, 02 Jul 1999 16:52:13 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> file, it takes an enormous amount of time. It is funny, because the
|
|
> source code has only 12 rules!!! I think it looks like an exponencial
|
|
> growth.
|
|
|
|
Right, that's the problem - some patterns (those with a lot of
|
|
ambiguity, where yours has because at any given time the scanner can
|
|
be in the middle of all sorts of combinations of the different
|
|
rules) blow up exponentially.
|
|
|
|
For your rules, there is an easy fix. Change the ".*" that comes fater
|
|
the directory name to "[^ ]*". With that in place, the rules are no
|
|
longer nearly so ambiguous, because then once one of the directories
|
|
has been matched, no other can be matched (since they all require a
|
|
leading blank).
|
|
|
|
If that's not an acceptable solution, then you can enter a start state
|
|
to pick up the .*\n after each directory is matched.
|
|
|
|
Also note that for speed, you'll want to add a ".*" rule at the end,
|
|
otherwise rules that don't match any of the patterns will be matched
|
|
very slowly, a character at a time.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-95
|
|
@unnumberedsec unnamed-faq-95
|
|
@example
|
|
@verbatim
|
|
To: Tielman Koekemoer <tielman@spi.co.za>
|
|
Subject: Re: Please help.
|
|
In-reply-to: Your message of Thu, 08 Jul 1999 13:20:37 PDT.
|
|
Date: Thu, 08 Jul 1999 08:20:39 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> I was hoping you could help me with my problem.
|
|
>
|
|
> I tried compiling (gnu)flex on a Solaris 2.4 machine
|
|
> but when I ran make (after configure) I got an error.
|
|
>
|
|
> --------------------------------------------------------------
|
|
> gcc -c -I. -I. -g -O parse.c
|
|
> ./flex -t -p ./scan.l >scan.c
|
|
> sh: ./flex: not found
|
|
> *** Error code 1
|
|
> make: Fatal error: Command failed for target `scan.c'
|
|
> -------------------------------------------------------------
|
|
>
|
|
> What's strange to me is that I'm only
|
|
> trying to install flex now. I then edited the Makefile to
|
|
> and changed where it says "FLEX = flex" to "FLEX = lex"
|
|
> ( lex: the native Solaris one ) but then it complains about
|
|
> the "-p" option. Is there any way I can compile flex without
|
|
> using flex or lex?
|
|
>
|
|
> Thanks so much for your time.
|
|
|
|
You managed to step on the bootstrap sequence, which first copies
|
|
initscan.c to scan.c in order to build flex. Try fetching a fresh
|
|
distribution from ftp.ee.lbl.gov. (Or you can first try removing
|
|
".bootstrap" and doing a make again.)
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-96
|
|
@unnumberedsec unnamed-faq-96
|
|
@example
|
|
@verbatim
|
|
To: Tielman Koekemoer <tielman@spi.co.za>
|
|
Subject: Re: Please help.
|
|
In-reply-to: Your message of Fri, 09 Jul 1999 09:16:14 PDT.
|
|
Date: Fri, 09 Jul 1999 00:27:20 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> First I removed .bootstrap (and ran make) - no luck. I downloaded the
|
|
> software but I still have the same problem. Is there anything else I
|
|
> could try.
|
|
|
|
Try:
|
|
|
|
cp initscan.c scan.c
|
|
touch scan.c
|
|
make scan.o
|
|
|
|
If this last tries to first build scan.c from scan.l using ./flex, then
|
|
your "make" is broken, in which case compile scan.c to scan.o by hand.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-97
|
|
@unnumberedsec unnamed-faq-97
|
|
@example
|
|
@verbatim
|
|
To: Sumanth Kamenani <skamenan@crl.nmsu.edu>
|
|
Subject: Re: Error
|
|
In-reply-to: Your message of Mon, 19 Jul 1999 23:08:41 PDT.
|
|
Date: Tue, 20 Jul 1999 00:18:26 PDT
|
|
From: Vern Paxson <vern>
|
|
|
|
> I am getting a compilation error. The error is given as "unknown symbol- yylex".
|
|
|
|
The parser relies on calling yylex(), but you're instead using the C++ scanning
|
|
class, so you need to supply a yylex() "glue" function that calls an instance
|
|
scanner of the scanner (e.g., "scanner->yylex()").
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-98
|
|
@unnumberedsec unnamed-faq-98
|
|
@example
|
|
@verbatim
|
|
To: daniel@synchrods.synchrods.COM (Daniel Senderowicz)
|
|
Subject: Re: lex
|
|
In-reply-to: Your message of Mon, 22 Nov 1999 11:19:04 PST.
|
|
Date: Tue, 23 Nov 1999 15:54:30 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
Well, your problem is the
|
|
|
|
switch (yybgin-yysvec-1) { /* witchcraft */
|
|
|
|
at the beginning of lex rules. "witchcraft" == "non-portable". It's
|
|
assuming knowledge of the AT&T lex's internal variables.
|
|
|
|
For flex, you can probably do the equivalent using a switch on YYSTATE.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-99
|
|
@unnumberedsec unnamed-faq-99
|
|
@example
|
|
@verbatim
|
|
To: archow@hss.hns.com
|
|
Subject: Re: Regarding distribution of flex and yacc based grammars
|
|
In-reply-to: Your message of Sun, 19 Dec 1999 17:50:24 +0530.
|
|
Date: Wed, 22 Dec 1999 01:56:24 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> When we provide the customer with an object code distribution, is it
|
|
> necessary for us to provide source
|
|
> for the generated C files from flex and bison since they are generated by
|
|
> flex and bison ?
|
|
|
|
For flex, no. I don't know what the current state of this is for bison.
|
|
|
|
> Also, is there any requrirement for us to neccessarily provide source for
|
|
> the grammar files which are fed into flex and bison ?
|
|
|
|
Again, for flex, no.
|
|
|
|
See the file "COPYING" in the flex distribution for the legalese.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-100
|
|
@unnumberedsec unnamed-faq-100
|
|
@example
|
|
@verbatim
|
|
To: Martin Gallwey <gallweym@hyperion.moe.ul.ie>
|
|
Subject: Re: Flex, and self referencing rules
|
|
In-reply-to: Your message of Sun, 20 Feb 2000 01:01:21 PST.
|
|
Date: Sat, 19 Feb 2000 18:33:16 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
> However, I do not use unput anywhere. I do use self-referencing
|
|
> rules like this:
|
|
>
|
|
> UnaryExpr ({UnionExpr})|("-"{UnaryExpr})
|
|
|
|
You can't do this - flex is *not* a parser like yacc (which does indeed
|
|
allow recursion), it is a scanner that's confined to regular expressions.
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@c TODO: Evaluate this faq.
|
|
@node unnamed-faq-101
|
|
@unnumberedsec unnamed-faq-101
|
|
@example
|
|
@verbatim
|
|
To: slg3@lehigh.edu (SAMUEL L. GULDEN)
|
|
Subject: Re: Flex problem
|
|
In-reply-to: Your message of Thu, 02 Mar 2000 12:29:04 PST.
|
|
Date: Thu, 02 Mar 2000 23:00:46 PST
|
|
From: Vern Paxson <vern>
|
|
|
|
If this is exactly your program:
|
|
|
|
> digit [0-9]
|
|
> digits {digit}+
|
|
> whitespace [ \t\n]+
|
|
>
|
|
> %%
|
|
> "[" { printf("open_brac\n");}
|
|
> "]" { printf("close_brac\n");}
|
|
> "+" { printf("addop\n");}
|
|
> "*" { printf("multop\n");}
|
|
> {digits} { printf("NUMBER = %s\n", yytext);}
|
|
> whitespace ;
|
|
|
|
then the problem is that the last rule needs to be "{whitespace}" !
|
|
|
|
Vern
|
|
@end verbatim
|
|
@end example
|
|
|
|
@node What is the difference between YYLEX_PARAM and YY_DECL?
|
|
@unnumberedsec What is the difference between YYLEX_PARAM and YY_DECL?
|
|
|
|
YYLEX_PARAM is not a flex symbol. It is for Bison. It tells Bison to pass extra
|
|
params when it calls yylex() from the parser.
|
|
|
|
YY_DECL is the Flex declaration of yylex. The default is similar to this:
|
|
|
|
@example
|
|
@verbatim
|
|
#define int yy_lex ()
|
|
@end verbatim
|
|
@end example
|
|
|
|
|
|
@node Why do I get "conflicting types for yylex" error?
|
|
@unnumberedsec Why do I get "conflicting types for yylex" error?
|
|
|
|
This is a compiler error regarding a generated Bison parser, not a Flex scanner.
|
|
It means you need a prototype of yylex() in the top of the Bison file.
|
|
Be sure the prototype matches YY_DECL.
|
|
|
|
@node How do I access the values set in a Flex action from within a Bison action?
|
|
@unnumberedsec How do I access the values set in a Flex action from within a Bison action?
|
|
|
|
With $1, $2, $3, etc. These are called "Semantic Values" in the Bison manual.
|
|
See @ref{Top, , , bison, the GNU Bison Manual}.
|
|
|
|
@node Appendices, Indices, FAQ, Top
|
|
@appendix Appendices
|
|
|
|
@menu
|
|
* Makefiles and Flex::
|
|
* Bison Bridge::
|
|
* M4 Dependency::
|
|
* Common Patterns::
|
|
@end menu
|
|
|
|
@node Makefiles and Flex, Bison Bridge, Appendices, Appendices
|
|
@appendixsec Makefiles and Flex
|
|
|
|
@cindex Makefile, syntax
|
|
|
|
In this appendix, we provide tips for writing Makefiles to build your scanners.
|
|
|
|
In a traditional build environment, we say that the @file{.c} files are the
|
|
sources, and the @file{.o} files are the intermediate files. When using
|
|
@code{flex}, however, the @file{.l} files are the sources, and the generated
|
|
@file{.c} files (along with the @file{.o} files) are the intermediate files.
|
|
This requires you to carefully plan your Makefile.
|
|
|
|
Modern @command{make} programs understand that @file{foo.l} is intended to
|
|
generate @file{lex.yy.c} or @file{foo.c}, and will behave
|
|
accordingly@footnote{GNU @command{make} and GNU @command{automake} are two such
|
|
programs that provide implicit rules for flex-generated scanners.}@footnote{GNU @command{automake}
|
|
may generate code to execute flex in lex-compatible mode, or to stdout. If this is not what you want,
|
|
then you should provide an explicit rule in your Makefile.am}. The
|
|
following Makefile does not explicitly instruct @command{make} how to build
|
|
@file{foo.c} from @file{foo.l}. Instead, it relies on the implicit rules of the
|
|
@command{make} program to build the intermediate file, @file{scan.c}:
|
|
|
|
@cindex Makefile, example of implicit rules
|
|
@example
|
|
@verbatim
|
|
# Basic Makefile -- relies on implicit rules
|
|
# Creates "myprogram" from "scan.l" and "myprogram.c"
|
|
#
|
|
LEX=flex
|
|
myprogram: scan.o myprogram.o
|
|
scan.o: scan.l
|
|
|
|
@end verbatim
|
|
@end example
|
|
|
|
|
|
For simple cases, the above may be sufficient. For other cases,
|
|
you may have to explicitly instruct @command{make} how to build your scanner.
|
|
The following is an example of a Makefile containing explicit rules:
|
|
|
|
@cindex Makefile, explicit example
|
|
@example
|
|
@verbatim
|
|
# Basic Makefile -- provides explicit rules
|
|
# Creates "myprogram" from "scan.l" and "myprogram.c"
|
|
#
|
|
LEX=flex
|
|
myprogram: scan.o myprogram.o
|
|
$(CC) -o $@ $(LDFLAGS) $^
|
|
|
|
myprogram.o: myprogram.c
|
|
$(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^
|
|
|
|
scan.o: scan.c
|
|
$(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^
|
|
|
|
scan.c: scan.l
|
|
$(LEX) $(LFLAGS) -o $@ $^
|
|
|
|
clean:
|
|
$(RM) *.o scan.c
|
|
|
|
@end verbatim
|
|
@end example
|
|
|
|
Notice in the above example that @file{scan.c} is in the @code{clean} target.
|
|
This is because we consider the file @file{scan.c} to be an intermediate file.
|
|
|
|
Finally, we provide a realistic example of a @code{flex} scanner used with a
|
|
@code{bison} parser@footnote{This example also applies to yacc parsers.}.
|
|
There is a tricky problem we have to deal with. Since a @code{flex} scanner
|
|
will typically include a header file (e.g., @file{y.tab.h}) generated by the
|
|
parser, we need to be sure that the header file is generated BEFORE the scanner
|
|
is compiled. We handle this case in the following example:
|
|
|
|
@example
|
|
@verbatim
|
|
# Makefile example -- scanner and parser.
|
|
# Creates "myprogram" from "scan.l", "parse.y", and "myprogram.c"
|
|
#
|
|
LEX = flex
|
|
YACC = bison -y
|
|
YFLAGS = -d
|
|
objects = scan.o parse.o myprogram.o
|
|
|
|
myprogram: $(objects)
|
|
scan.o: scan.l parse.c
|
|
parse.o: parse.y
|
|
myprogram.o: myprogram.c
|
|
|
|
@end verbatim
|
|
@end example
|
|
|
|
In the above example, notice the line,
|
|
|
|
@example
|
|
@verbatim
|
|
scan.o: scan.l parse.c
|
|
@end verbatim
|
|
@end example
|
|
|
|
, which lists the file @file{parse.c} (the generated parser) as a dependency of
|
|
@file{scan.o}. We want to ensure that the parser is created before the scanner
|
|
is compiled, and the above line seems to do the trick. Feel free to experiment
|
|
with your specific implementation of @command{make}.
|
|
|
|
|
|
For more details on writing Makefiles, see @ref{Top, , , make, The
|
|
GNU Make Manual}.
|
|
|
|
@node Bison Bridge, M4 Dependency, Makefiles and Flex, Appendices
|
|
@section C Scanners with Bison Parsers
|
|
|
|
@cindex bison, bridging with flex
|
|
@vindex yylval
|
|
@vindex yylloc
|
|
@tindex YYLTYPE
|
|
@tindex YYSTYPE
|
|
|
|
This section describes the @code{flex} features useful when integrating
|
|
@code{flex} with @code{GNU bison}@footnote{The features described here are
|
|
purely optional, and are by no means the only way to use flex with bison.
|
|
We merely provide some glue to ease development of your parser-scanner pair.}.
|
|
Skip this section if you are not using
|
|
@code{bison} with your scanner. Here we discuss only the @code{flex}
|
|
half of the @code{flex} and @code{bison} pair. We do not discuss
|
|
@code{bison} in any detail. For more information about generating
|
|
@code{bison} parsers, see @ref{Top, , , bison, the GNU Bison Manual}.
|
|
|
|
A compatible @code{bison} scanner is generated by declaring @samp{%option
|
|
bison-bridge} or by supplying @samp{--bison-bridge} when invoking @code{flex}
|
|
from the command line. This instructs @code{flex} that the macro
|
|
@code{yylval} may be used. The data type for
|
|
@code{yylval}, @code{YYSTYPE},
|
|
is typically defined in a header file, included in section 1 of the
|
|
@code{flex} input file. For a list of functions and macros
|
|
available, @xref{bison-functions}.
|
|
|
|
The declaration of yylex becomes,
|
|
|
|
@findex yylex (reentrant version)
|
|
@example
|
|
@verbatim
|
|
int yylex ( YYSTYPE * lvalp, yyscan_t scanner );
|
|
@end verbatim
|
|
@end example
|
|
|
|
If @code{%option bison-locations} is specified, then the declaration
|
|
becomes,
|
|
|
|
@findex yylex (reentrant version)
|
|
@example
|
|
@verbatim
|
|
int yylex ( YYSTYPE * lvalp, YYLTYPE * llocp, yyscan_t scanner );
|
|
@end verbatim
|
|
@end example
|
|
|
|
Note that the macros @code{yylval} and @code{yylloc} evaluate to pointers.
|
|
Support for @code{yylloc} is optional in @code{bison}, so it is optional in
|
|
@code{flex} as well. The following is an example of a @code{flex} scanner that
|
|
is compatible with @code{bison}.
|
|
|
|
@cindex bison, scanner to be called from bison
|
|
@example
|
|
@verbatim
|
|
/* Scanner for "C" assignment statements... sort of. */
|
|
%{
|
|
#include "y.tab.h" /* Generated by bison. */
|
|
%}
|
|
|
|
%option bison-bridge bison-locations
|
|
%
|
|
|
|
[[:digit:]]+ { yylval->num = atoi(yytext); return NUMBER;}
|
|
[[:alnum:]]+ { yylval->str = strdup(yytext); return STRING;}
|
|
"="|";" { return yytext[0];}
|
|
. {}
|
|
%
|
|
@end verbatim
|
|
@end example
|
|
|
|
As you can see, there really is no magic here. We just use
|
|
@code{yylval} as we would any other variable. The data type of
|
|
@code{yylval} is generated by @code{bison}, and included in the file
|
|
@file{y.tab.h}. Here is the corresponding @code{bison} parser:
|
|
|
|
@cindex bison, parser
|
|
@example
|
|
@verbatim
|
|
/* Parser to convert "C" assignments to lisp. */
|
|
%{
|
|
/* Pass the argument to yyparse through to yylex. */
|
|
#define YYPARSE_PARAM scanner
|
|
#define YYLEX_PARAM scanner
|
|
%}
|
|
%locations
|
|
%pure_parser
|
|
%union {
|
|
int num;
|
|
char* str;
|
|
}
|
|
%token <str> STRING
|
|
%token <num> NUMBER
|
|
%%
|
|
assignment:
|
|
STRING '=' NUMBER ';' {
|
|
printf( "(setf %s %d)", $1, $3 );
|
|
}
|
|
;
|
|
@end verbatim
|
|
@end example
|
|
|
|
@node M4 Dependency, Common Patterns, Bison Bridge, Appendices
|
|
@section M4 Dependency
|
|
@cindex m4
|
|
The macro processor @code{m4}@footnote{The use of m4 is subject to change in
|
|
future revisions of flex. It is not part of the public API of flex. Do not depend on it.}
|
|
must be installed wherever flex is installed.
|
|
@code{flex} invokes @samp{m4}, found by searching the directories in the
|
|
@code{PATH} environment variable. Any code you place in section 1 or in the
|
|
actions will be sent through m4. Please follow these rules to protect your
|
|
code from unwanted @code{m4} processing.
|
|
|
|
@itemize
|
|
|
|
@item Do not use symbols that begin with, @samp{m4_}, such as, @samp{m4_define},
|
|
or @samp{m4_include}, since those are reserved for @code{m4} macro names. If for
|
|
some reason you need m4_ as a prefix, use a preprocessor #define to get your
|
|
symbol past m4 unmangled.
|
|
|
|
@item Do not use the strings @samp{[[} or @samp{]]} anywhere in your code. The
|
|
former is not valid in C, except within comments and strings, but the latter is valid in
|
|
code such as @code{x[y[z]]}. The solution is simple. To get the literal string
|
|
@code{"]]"}, use @code{"]""]"}. To get the array notation @code{x[y[z]]},
|
|
use @code{x[y[z] ]}. Flex will attempt to detect these sequences in user code, and
|
|
escape them. However, it's best to avoid this complexity where possible, by
|
|
removing such sequences from your code.
|
|
|
|
@end itemize
|
|
|
|
@code{m4} is only required at the time you run @code{flex}. The generated
|
|
scanner is ordinary C or C++, and does @emph{not} require @code{m4}.
|
|
|
|
@node Common Patterns, ,M4 Dependency, Appendices
|
|
@section Common Patterns
|
|
@cindex patterns, common
|
|
|
|
This appendix provides examples of common regular expressions you might use
|
|
in your scanner.
|
|
|
|
@menu
|
|
* Numbers::
|
|
* Identifiers::
|
|
* Quoted Constructs::
|
|
* Addresses::
|
|
@end menu
|
|
|
|
|
|
@node Numbers, Identifiers, ,Common Patterns
|
|
@subsection Numbers
|
|
|
|
@table @asis
|
|
|
|
@item C99 decimal constant
|
|
@code{([[:digit:]]@{-@}[0])[[:digit:]]*}
|
|
|
|
@item C99 hexadecimal constant
|
|
@code{0[xX][[:xdigit:]]+}
|
|
|
|
@item C99 octal constant
|
|
@code{0[01234567]*}
|
|
|
|
@item C99 floating point constant
|
|
@verbatim
|
|
{dseq} ([[:digit:]]+)
|
|
{dseq_opt} ([[:digit:]]*)
|
|
{frac} (({dseq_opt}"."{dseq})|{dseq}".")
|
|
{exp} ([eE][+-]?{dseq})
|
|
{exp_opt} ({exp}?)
|
|
{fsuff} [flFL]
|
|
{fsuff_opt} ({fsuff}?)
|
|
{hpref} (0[xX])
|
|
{hdseq} ([[:xdigit:]]+)
|
|
{hdseq_opt} ([[:xdigit:]]*)
|
|
{hfrac} (({hdseq_opt}"."{hdseq})|({hdseq}"."))
|
|
{bexp} ([pP][+-]?{dseq})
|
|
{dfc} (({frac}{exp_opt}{fsuff_opt})|({dseq}{exp}{fsuff_opt}))
|
|
{hfc} (({hpref}{hfrac}{bexp}{fsuff_opt})|({hpref}{hdseq}{bexp}{fsuff_opt}))
|
|
|
|
{c99_floating_point_constant} ({dfc}|{hfc})
|
|
@end verbatim
|
|
|
|
See C99 section 6.4.4.2 for the gory details.
|
|
|
|
@end table
|
|
|
|
@node Identifiers, Quoted Constructs, Numbers, Common Patterns
|
|
@subsection Identifiers
|
|
|
|
@table @asis
|
|
|
|
@item C99 Identifier
|
|
@verbatim
|
|
ucn ((\\u([[:xdigit:]]{4}))|(\\U([[:xdigit:]]{8})))
|
|
nondigit [_[:alpha:]]
|
|
c99_id ([_[:alpha:]]|{ucn})([_[:alnum:]]|{ucn})*
|
|
@end verbatim
|
|
|
|
Technically, the above pattern does not encompass all possible C99 identifiers, since C99 allows for
|
|
"implementation-defined" characters. In practice, C compilers follow the above pattern, with the
|
|
addition of the @samp{$} character.
|
|
|
|
@item UTF-8 Encoded Unicode Code Point
|
|
@verbatim
|
|
[\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF]([\x80-\xBF]{2})|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF]([\x80-\xBF]{2})|[\xF1-\xF3]([\x80-\xBF]{3})|\xF4[\x80-\x8F]([\x80-\xBF]{2})
|
|
@end verbatim
|
|
|
|
@end table
|
|
|
|
@node Quoted Constructs, Addresses, Identifiers, Common Patterns
|
|
@subsection Quoted Constructs
|
|
|
|
@table @asis
|
|
@item C99 String Literal
|
|
@code{L?\"([^\"\\\n]|(\\['\"?\\abfnrtv])|(\\([0123456]@{1,3@}))|(\\x[[:xdigit:]]+)|(\\u([[:xdigit:]]@{4@}))|(\\U([[:xdigit:]]@{8@})))*\"}
|
|
|
|
@item C99 Comment
|
|
@code{("/*"([^*]|"*"[^/])*"*/")|("/"(\\\n)*"/"[^\n]*)}
|
|
|
|
Note that in C99, a @samp{//}-style comment may be split across lines, and, contrary to popular belief,
|
|
does not include the trailing @samp{\n} character.
|
|
|
|
A better way to scan @samp{/* */} comments is by line, rather than matching
|
|
possibly huge comments all at once. This will allow you to scan comments of
|
|
unlimited length, as long as line breaks appear at sane intervals. This is also
|
|
more efficient when used with automatic line number processing. @xref{option-yylineno}.
|
|
|
|
@verbatim
|
|
<INITIAL>{
|
|
"/*" BEGIN(COMMENT);
|
|
}
|
|
<COMMENT>{
|
|
"*/" BEGIN(0);
|
|
[^*\n]+ ;
|
|
"*"[^/] ;
|
|
\n ;
|
|
}
|
|
@end verbatim
|
|
|
|
@end table
|
|
|
|
@node Addresses, ,Quoted Constructs, Common Patterns
|
|
@subsection Addresses
|
|
|
|
@table @asis
|
|
|
|
@item IPv4 Address
|
|
@verbatim
|
|
dec-octet [0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]
|
|
IPv4address {dec-octet}\.{dec-octet}\.{dec-octet}\.{dec-octet}
|
|
@end verbatim
|
|
|
|
@item IPv6 Address
|
|
@verbatim
|
|
h16 [0-9A-Fa-f]{1,4}
|
|
ls32 {h16}:{h16}|{IPv4address}
|
|
IPv6address ({h16}:){6}{ls32}|
|
|
::({h16}:){5}{ls32}|
|
|
({h16})?::({h16}:){4}{ls32}|
|
|
(({h16}:){0,1}{h16})?::({h16}:){3}{ls32}|
|
|
(({h16}:){0,2}{h16})?::({h16}:){2}{ls32}|
|
|
(({h16}:){0,3}{h16})?::{h16}:{ls32}|
|
|
(({h16}:){0,4}{h16})?::{ls32}|
|
|
(({h16}:){0,5}{h16})?::{h16}|
|
|
(({h16}:){0,6}{h16})?::
|
|
@end verbatim
|
|
|
|
See @uref{http://www.ietf.org/rfc/rfc2373.txt, RFC 2373} for details.
|
|
Note that you have to fold the definition of @code{IPv6address} into one
|
|
line and that it also matches the ``unspecified address'' ``::''.
|
|
|
|
@item URI
|
|
@code{(([^:/?#]+):)?("//"([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?}
|
|
|
|
This pattern is nearly useless, since it allows just about any character
|
|
to appear in a URI, including spaces and control characters. See
|
|
@uref{http://www.ietf.org/rfc/rfc2396.txt, RFC 2396} for details.
|
|
|
|
@end table
|
|
|
|
|
|
@node Indices, , Appendices, Top
|
|
@unnumbered Indices
|
|
|
|
@menu
|
|
* Concept Index::
|
|
* Index of Functions and Macros::
|
|
* Index of Variables::
|
|
* Index of Data Types::
|
|
* Index of Hooks::
|
|
* Index of Scanner Options::
|
|
@end menu
|
|
|
|
@node Concept Index, Index of Functions and Macros, Indices, Indices
|
|
@unnumberedsec Concept Index
|
|
|
|
@printindex cp
|
|
|
|
@node Index of Functions and Macros, Index of Variables, Concept Index, Indices
|
|
@unnumberedsec Index of Functions and Macros
|
|
|
|
This is an index of functions and preprocessor macros that look like functions.
|
|
For macros that expand to variables or constants, see @ref{Index of Variables}.
|
|
|
|
@printindex fn
|
|
|
|
@node Index of Variables, Index of Data Types, Index of Functions and Macros, Indices
|
|
@unnumberedsec Index of Variables
|
|
|
|
This is an index of variables, constants, and preprocessor macros
|
|
that expand to variables or constants.
|
|
|
|
@printindex vr
|
|
|
|
@node Index of Data Types, Index of Hooks, Index of Variables, Indices
|
|
@unnumberedsec Index of Data Types
|
|
@printindex tp
|
|
|
|
@node Index of Hooks, Index of Scanner Options, Index of Data Types, Indices
|
|
@unnumberedsec Index of Hooks
|
|
|
|
This is an index of "hooks" that the user may define. These hooks typically correspond
|
|
to specific locations in the generated scanner, and may be used to insert arbitrary code.
|
|
|
|
@printindex hk
|
|
|
|
@node Index of Scanner Options, , Index of Hooks, Indices
|
|
@unnumberedsec Index of Scanner Options
|
|
|
|
@printindex op
|
|
|
|
@c A vim script to name the faq entries. delete this when faqs are no longer
|
|
@c named "unnamed-faq-XXX".
|
|
@c
|
|
@c fu! Faq2 () range abort
|
|
@c let @r=input("Rename to: ")
|
|
@c exe "%s/" . @w . "/" . @r . "/g"
|
|
@c normal 'f
|
|
@c endf
|
|
@c nnoremap <F5> 1G/@node\s\+unnamed-faq-\d\+<cr>mfww"wy5ezt:call Faq2()<cr>
|
|
|
|
@bye
|