REGEX/CLAWK/LEXER Packages
If you just came here for the code, you can find it
here. This is a tar file
containing the REGEX, CLAWK, and LEXER packages, as well as
some test code.
Common Lisp is a wonderful language for processing
complicated data structures, but let's face it -- while it's
great at symbols and lists it's pretty sucky at plain old
text.
Partly because of this shortcoming, partly because I had
a regex engine lying around, and partly because my beloved
Lisp Machine doesn't come with AWK or Perl, I've written a
couple of packages to help alleviate the situation.
- REGEX contains a full-featured
regular expression compiler and matching engine. If
you're one of those who believe that Lisp is slow, note
that this regex engine is roughly 5x-20x times faster than
the GNU regex matcher (which is obviously written in
C).
- CLAWK contains most of the
interesting bits of AWK (and even some of the
not-so-interesting bits that are here anyway to ease
porting). Most of the features that make PERL a win over
AWK are already present in CL itself, and so aren't
included here.
- LEXER contains a lexer-generator
macro called DEFLEXER. It produces lexers that are
compatible with the parsers produced with Xanalys's
DEFPARSER macro.
REGEX package
The regex engine is a pretty full-featured matcher, and
thus is useful by itself. It was originally written as a
prototype for a C++ matcher, though it has since diverged
greatly.
The regex compiler supports the following pattern syntax:
- ^ matches the start of a string.
- $ matches the end of a string.
- [...] denotes a character class.
- [^...] denotes a negated character class.
- [:...:] denotes a special character class.
- [:alpha:] == [A-Za-z]
- [:upper:] == [A-Z]
- [:lower:] == [a-z]
- [:digit:] == [0-9]
- [:alnum:] == [A-Za-z0-9]
- [:xdigit:] == [A-Fa-f0-9]
- [:space:] == whitespace
- [:punct:] == punctuation marks
- [:graph:] == printable characters other than space
- [:cntrl:] == control characters
- [:word:] == wordlike characters
- [^:...:] denotes a negated special character class.
- . matches any character.
- (...) delimits a regex subexpression. Also denotes a
register pattern.
- (?...) denotes a regex subexpression that will
not be captured in a register.
- (?=...) denotes a regex
subexpression that will be used as a forward lookahead.
If the subexpression matches, then the rest of the match
will continue as if the lookahead match had not occurred
(i.e. it does not consume the candidate string). It will
not be captured in a register, though it can contain
subexpressions that may be captured.
- (?!...) denotes a regex subexpression that will be used
as a negative forward lookahead (the match will continue only
if the lookahead failed to match). It will not be
captured in a register, though it can contain
subexpressions that may be captured.
- * denotes the kleene closure of the previous regex
subexpression.
- + denotes the positive closure of the previous regex
subexpression.
- *? denotes the non-greedy kleene closure of the
previous regex subexpression.
- +? denotes the non-greedy positive closure of the
previous regex subexpression.
- ? denotes the greedy match of 0 or 1 occurrences of
the previous regex subexpression.
- ?? denotes the non-greedy match of 0 or 1 occurrences
of the previous regex subexpression.
- \nn denotes a back-match against the contents of a
previously-matched register.
- {nn,mm} denotes a bounded repetition.
- {nn,mm}? denotes a non-greedy bounded repetition.
- \n, \t, \r have their normal meanings.
- \d matches any decimal character,
\D matches any nondecimal character.
- \w matches any wordlike
character, \W matches any nonwordlike character.
- \s matches any whitespace
character, \S matches any nonspace character.
- \< matches at the start of a
word. \> matches at the end of a word.
- \<char> that character (escapes an otherwise
special meaning).
- Special characters lose their specialness when
escaped. There is a flag to control this.
- All other characters are matched literally.
There are a variety of functions in the REGEX package
that allow the programmer to adjust the allowable regular
expression syntax:
- The function ESCAPE-SPECIAL-CHARS allows you to change
whether the meta-characters have their magic meaning when
escaped or unescaped. The default behavior (per AWK
syntax) is that special chars are unescaped.
- The function ALLOW-BACKMATCH allows you to change
whether or not the \nn syntax is allowed. By default it
is allowed.
- The function ALLOW-RANGEMATCH allows you to change
whether or not the the {nn,mm} bounded repetition syntax
is allowed. By default it is allowed.
- The function ALLOW-NONGREEDY-QUANTIFIERS allows you to
change whether or not the *?, +?, ??, and {nn,mm}?
quantifiers are recognized. By default they are
allowed.
- The function ALLOW-NONREGISTER-GROUPS allows you to
change whether or not the (?...) syntax is recognized. By
default it is allowed.
- The function DOT-MATCHES-NEWLINE allows you to change
whether '.' in a pattern matches the newline character.
This is false by default.
Parenthesized expressions within the pattern are
considered a register pattern, and will be recorded for use
after the match. There is an implicit set of parentheses
around the entire expression, so the bounds of the matched
text itself will always occupy register 0.
Extensions that will be coming soon include:
- I am working on a second backend for the regex
compiler that generates an even faster matcher (~4-20x
faster on Symbolics, ~ 2x faster on LWW). The compilation
process itself is substantially slower. I've got some
more work to do to get the speed up even further on
Lispworks, although the current system is already much,
much faster than GNU Regex.
- Optionally allowing a negated regex pattern using the
<pattern> '^' syntax. This also subsumes the
negated character class in that
[^...] ===
[...]^.
- Faster scans by using a possible-prefix set. This
isn't real high priority at the moment since matching is
plenty fast already :-)
- Prefix and postfix context patterns ala LEX.
Regex has been recently enhanced. Everything from the
parser back has been completely rewritten. The regex system
now includes a bunch of functions for manipulating regex
parse trees directly, a multipass optimizer and code
generator, and a new matching engine.
The new regex system does a better job of optimizing a
wider range of patterns. It also supports an extension that
allows you to provide an "accept" function to the match-str
function. This acceptfn takes the start and end position as
parameters, and can find the string itself in the special
variable *STR* and the registers in the special variable
*regs*. It returns either nil to force the matcher to
backtrack, or a non-nil value which will be returned as
the success code for the match.
An additional change is that register patterns within
quantified patterns now return the leftmost occurrence in
the source string. There is a flag to force the more usual
rightmost match, but this will reduce the applicability of
many critical optimizations.
The latest version of regex supports the Perl \d, \D, \w,
\W, \s, and \S metasequences, as well as the egrep \<
start-of-word and \> end-of-word metasequences.
Get the code for the latest regex
here.
CLAWK Package
This package implements nearly all the features provided
by the Unix AWK language, albeit with a pretty lisp-y
flavor. In addition, it provides a large set of macros for
doing more complicated processing beyond that provided by
AWK.
To give you an idea of the flavor of CLAWK programming,
here's an example from Ch 1 of The Awk Programming
Language that calculates the number of employees,
total pay for all employees, and average pay.
{ pay = pay + $2 * $3 }
END { print NR, "employees"
print "total pay is", pay
print "average pay is", pay/NR
}
The book gives the text file "emp.data" as example
input.
Beth 4.00 0
Dan 3.75 0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Suzie 4.25 18
When the awk program (called say, SAMPLE.AWK) is executed
with the command
$ awk -f sample.awk
emp.data
it produces the following output:
6 employees
total pay is 337.5
average pay is 56.25
Here's a straightforward translation of SAMPLE.AWK using
the CLAWK package:
(defawk sample (&aux (pay 0))
(t (incf pay ($* $2 $3)))
(END ($print *NR* "employees")
($print "total pay is" pay)
($print "average pay is " (/ pay *NR*))) )
When this function is executed with (sample
"emp.data"), it produces the same output:
6 employees
total pay is 337.5
average pay is 56.25
NIL
(the NIL is the return value from the SAMPLE
function).
That's nearly as concise as the original AWK. Here's a
version that doesn't use the DEFAWK macro, and tries to be a
bit more Lispish.
(defun sample (filename &aux (pay 0))
(do-file-fields (filename (name payrate hrsworked))
(declare (ignore name))
(incf pay (* (num payrate) (num hrsworked))))
(format t "~%~D employees" *NR*)
(format t "~%total pay is ~F" pay)
(format t "~%average pay is ~F" (/ pay *NR*)) )
... which prints out the same thing as above.
Here's another example from The AWK Programming
Language that shows off the regular expressions a
little better. This program performs some basic validity
checks on a Unix password file.
BEGIN {
FS = ";" }
NF != 7 {
printf("line %d, does not have 7 fields: %s\n", NR, $0)}
$1 ~ /[^A-Za-z0-9]/ {
printf("line %d, nonalphanumeric user id: %s\n", NR, $0)}
$2 == "" {
printf("line %d, no password: %s\n", NR, $0)}
$3 ~ /[^0-9]/ {
printf("line %d, nonnumeric user id: %s\n", NR, $0)}
$4 ~ /[^0-9]/ {
printf("line %d, nonnumeric group id: %s\n", NR, $0)}
$6 !~ /^\// {
printf("line %d, invalid login directory: %s\n", NR, $0)}
And this is the same program using CLAWK:
(defawk checkpass ()
(BEGIN
(setq *FS* ";"))
((/= *NF* 7)
(format t "~%line ~D, does not have 7 fields: ~S" *NR* $0))
((~ $1 #/[^A-Za-z0-9]/)
(format t "~%line ~D, nonalphanumeric user id: ~S" *NR* $0))
(($== $2 "")
(format t "~%line ~D, no password: ~S" *NR* $0))
((~ $3 #/[^0-9]/)
(format t "~%line ~D, nonnumeric user id: ~S" *NR* $0))
((~ $4 #/[^0-9]/)
(format t "~%line ~D, nonnumeric group id: ~S" *NR* $0))
((!~ $6 #/^\//)
(format t "~%line ~D, invalid login directory: ~S" *NR* $0)) )
Which you must admit is awfully close to the original
AWK. Since this uses the #/ readmacro you must evaluate
(clawk:install-regex-syntax) before evaluating this
sample.
CLAWK summary
The INSTALL-REGEX-SYNTAX function sets up a
readmacro on #\/ that parses to an unescaped #\/, and
interns the resulting string in the current package. This
(a) gives a nice AWK-ish syntax (b) results in an object
that can be printed readably and interacts well in the
listener even without Dynamic Windows, and (c) gives CLAWK a
convenient place to hook the compiled matcher and some
associated information, speeding things up nicely. You can
still specify these things using the |...| syntax, but it's
not quite as nice, since #\| is a common character in regex
patterns.
The INSTALL-CMD-SYNTAX function sets up a
readmacro on #` that parses to an unescaped ` character, and
expands into a call to
CLAWK::CALL-SYSTEM-CATCHING-OUTPUT. This function
sends the command to the shell and returns an input stream
containing the output of the command. This stream can
then be used by CLAWK:FOR-STREAM-LINES or any of
the CL stream functions. The readmacro is defined for all
non-Symbolics systems, although currently only LispWorks is
supported by CALL-SYSTEM-CATCHING-OUTPUT.
The various special variables *FS*, *NR*,
*NF*, *RSTART*, *RLENGTH*, etc
are fairly obvious if you've ever used AWK, as are the
SUB, GSUB, SPLIT, INDEX,
MATCH, SUBSTR, and ~ and
!~ functions.
$SUB, $GSUB, $SPLIT, $INDEX,
$MATCH, and $SUBSTR do the same, but they
attempt to coerce the various parameters to the right type
(so you can pass in a string for a parameter that expects a
number).
WITH-PATTERNS is a macro that cleans up the
syntax for using dynamically-generated regex patterns. A
future extension is for this to use the fast regex compiler
instead of the closure-based regex compiler when the
compiler is available and the pattern is a string literal.
WITH-FIELDS is a macro for doing line-splitting
into variables. WITH-FIELDS takes a optional field
list (in destructuring-bind syntax) and an optional string
(default is the current line -- see DO-LINES) and
field-separator (default is *FS*). If you don't give a set
of variables to use, it will use the $ variables. It splits
the line using SPLIT, and executes the body forms
in an environment containing the variables. The $n
variables are locally bound special, so they will revert on
exit.
WITH-SUBMATCHES is a macro for processing
register submatches. It takes a list of variables and a set
of body forms, and binds the variables to the strings
corresponding to the register submatches (or nil if the
register didn't match), and evaluates the body forms in this
environment. This is particularly handy for use in the
consequent clauses of match-case.
MATCH-CASE is similar to CASE except
that the value must be a string, and the cases are strings
or regex symbols. Handy for simulating the implicit outer
match structure of an AWK program, without being limited to
just the toplevel. I tend to find that my serious AWK
programs pretty quickly devolve into a BEGIN clause that
calls a bunch of functions to do the real work, and this
macro (along with the next few) really make this cleaner;
allowing me to reuse that nifty auto-looping / splitting /
matching mechanism wherever I need it.
MATCH-WHEN takes a set of forms similar to the
toplevel CLAWK forms (with BEGIN, END, and other pattern
clauses), but unlike the toplevel it does no looping. It
executes all the BEGIN clauses first, then conditionally
executes each of the pattern clauses, then executes all the
END clauses. It winds up looking a lot like COND,
except that it recognizes several special flavors of test
expression, and is basically equivalent to
(progn (when test1 . consequent1)
(when test2 . consequent2)
...)
As with real AWK, the default action is to print the
current line.
DO-FILE-LINES is a macro that takes a filename
and a set of body forms. It opens the file in read mode,
and loops over the lines in the file binding the line
variable to the current line and executing the body forms.
It doesn't do any line splitting, though it maintain the
*NR*, *FNR*, and $0 variables. Executing (NEXT)
within the body will restart the body at the next input
line.
DO-STREAM-LINES is a closely related version
that takes a stream (or t for the standard input
stream).
DO-FILE-FIELDS is a macro that takes a filename,
an optional field-list, and a set of body forms. It opens
the file in read mode, and loops over the lines in the file
splitting them into the specified field variables (or the $n
variables if no field list was given), and executes the body
forms in this new environment. Because of the splitting
overhead, it's not as snappy as FOR-FILE-LINES.
Executing (NEXT) within the body will restart the
processing at the next input line.
DO-STREAM-FIELDS is a closely related version
that takes a stream (or t for the standard input
stream).
WHEN-STREAM-FIELDS and WHEN-FILE-FIELDS
implement most of the CLAWK toplevel as a reusable special
form. It takes an input stream (or file) and an optional
field list, and a set of MATCH-WHEN clauses. The
BEGIN clauses will be executed before the file is opened,
the pattern clauses will be executed as with
MATCH-WHEN, and the END clauses will executed after
the file is closed, and the body. One limitation is it
doesn't currently support range patterns (it's pretty simple
to add, I just haven't gotten around to it).
DEFAWK is the highest-level macro, and one that
closely mimics the awk toplevel. It takes a name, a
parameter list consisting of &key and &aux parameters, and a
set of MATCH-WHEN clauses. It binds the parameter
list to the variable ARGS, and evaluates the BEGIN clauses.
The BEGIN clauses are free to modify ARGS. Then it loops
through the (possibly modified) ARGS, executing the pattern
clauses for each stream and pathname in the list. It
handles docstrings correctly.
The handling of the parameter list is perhaps the oddest
thing about the DEFAWK macro. It is intended to
closely match the behavior of a command-line AWK. A
pleasant difference is that keyword parameters are
automatically parsed out, instead of requiring you to do it
manually like in real AWK.
The $+ , $-, $*, $/,
$REM, $EXPT, $++, #=,
$<, $>, $<=, $>=,
$/=, $MIN, $MAX, $ZEROP,
$LENGTH, $PRINT functions can take numbers
and strings and do the AWK thing, coercing the arguments
back and forth as necessary. $++ is the
concatenation operator (in AWK this is indicated by space,
which clearly doesn't work in the Lisp world).
$LENGTH is overloaded on associative arrays
(i.e. hashtables) to return the
HASH-TABLE-COUNT.
The generic functions INT, STR, and
NUM handle the common coercions. AWK does provide
an INT function, but string coercions are done by
concatenating the value with the empty string, and numeric
coercions are done by adding 0 to the value. While these
kludges also work in CLAWK, the NUM and STR
functions are to be preferred.
The $ARRAY, $AREF, $FOR
$IN and $DELETE functions (and macros)
implement AWK-like associative arrays (including the amusing
SUBSEP behavior for multiple indices), except that unlike
AWK there's nothing to keep you from nesting these
"arrays".
$0, $1, ..., $20 are
symbol-macros that access the fields of the current line.
$#0, $#1, ..., $#20 do also,
but they attempt to interpret the value as a number.
%0, %1, ..., %20 are
symbol-macros that index into the recent submatches.
%#0, %#1, ..., %#20 do also,
but they attempt to interpret the submatch value as a
number.
This system also defines a clawk-user package.
Get the code for CLAWK here.
LEXER package
The LEXER package implements a lexical-analyzer-generator
called DEFLEXER, which is built on top of both
REGEX and CLAWK. Many of the optimizations in the recent
rewrite of the regex engine went into optimizing the sorts
of patterns generated by DEFLEX.
The default lexer doesn't implement full greediness. If you have a rule for ints followed by a rule for floats, the int rule will match on the part before the decimal before the float rule gets a change to look at it. You can fix this by specifying :flex-compatible as the first rule. This gives all patterns a chance to examine the text and takes the one that matches the longest string (first pattern wins in case of a tie). The down side of this option is that it slows down the analyser. If you can solve the issue by reordering your rules that's the way to do it.
I'm currently writing an AWK->CLAWK translator using this
as the lexer, and it's working fine. As far as I can tell,
the DEFLEXER-generated lexing functions should be fast
enough for production use.
Currently, the LEX/FLEX/BISON feature of switching
productions on and off using state variables is not
supported, but it's a pretty simple feature to add. If
you're using LEXER and discover you need this feature,
let me know.
It also doesn't yet support prefix and postfix context
patterns. This isn't quite so trivial to add, but it's
planned for a future release of regex, so LEXER will be
getting it someday.
Anyway, Here's a simple DEFLEXER example:
(deflexer test-lexer
("[0-9]+([.][0-9]+([Ee][0-9]+)?)"
(return (values 'flt (num %0))))
("[0-9]+"
(return (values 'int (int %0))))
("[:alpha:][:alnum:]*"
(return (values 'name %0)))
("[:space:]+") )
> (setq *lex* (test-lexer "1.0 12 fred 10.23e45"))
<closure>
> (funcall *lex*)
FLT
1.0
> (funcall *lex*)
INT
12
> (funcall *lex*)
NAME
"fred"
> (funcall *lex*)
FLT
1.0229999999999997E46
> (funcall *lex*)
NIL
NIL
You can also write this lexer using the :flex-compatible option, in which case you can write the int and flt rules in any order.
(deflexer test-lexer
:flex-compatible
("[0-9]+"
(return (values 'int (int %0))))
("[0-9]+([.][0-9]+([Ee][0-9]+)?)"
(return (values 'flt (num %0))))
("[:space:]+")
)
Get the code for LEXER here.