REGEX/CLAWK/LEXER Packages

If you just came here for the code, you can find it here. This is a tar file containing the REGEX, CLAWK, and LEXER packages, as well as some test code.

Common Lisp is a wonderful language for processing complicated data structures, but let's face it -- while it's great at symbols and lists it's pretty sucky at plain old text.

Partly because of this shortcoming, partly because I had a regex engine lying around, and partly because my beloved Lisp Machine doesn't come with AWK or Perl, I've written a couple of packages to help alleviate the situation.






REGEX package

The regex engine is a pretty full-featured matcher, and thus is useful by itself. It was originally written as a prototype for a C++ matcher, though it has since diverged greatly.

The regex compiler supports the following pattern syntax:

There are a variety of functions in the REGEX package that allow the programmer to adjust the allowable regular expression syntax:

Parenthesized expressions within the pattern are considered a register pattern, and will be recorded for use after the match. There is an implicit set of parentheses around the entire expression, so the bounds of the matched text itself will always occupy register 0.

Extensions that will be coming soon include:

  1. I am working on a second backend for the regex compiler that generates an even faster matcher (~4-20x faster on Symbolics, ~ 2x faster on LWW). The compilation process itself is substantially slower. I've got some more work to do to get the speed up even further on Lispworks, although the current system is already much, much faster than GNU Regex.
  2. Optionally allowing a negated regex pattern using the <pattern> '^' syntax. This also subsumes the negated character class in that [^...] === [...]^.
  3. Faster scans by using a possible-prefix set. This isn't real high priority at the moment since matching is plenty fast already :-)
  4. Prefix and postfix context patterns ala LEX.

Regex has been recently enhanced. Everything from the parser back has been completely rewritten. The regex system now includes a bunch of functions for manipulating regex parse trees directly, a multipass optimizer and code generator, and a new matching engine.

The new regex system does a better job of optimizing a wider range of patterns. It also supports an extension that allows you to provide an "accept" function to the match-str function. This acceptfn takes the start and end position as parameters, and can find the string itself in the special variable *STR* and the registers in the special variable *regs*. It returns either nil to force the matcher to backtrack, or a non-nil value which will be returned as the success code for the match.

An additional change is that register patterns within quantified patterns now return the leftmost occurrence in the source string. There is a flag to force the more usual rightmost match, but this will reduce the applicability of many critical optimizations.

The latest version of regex supports the Perl \d, \D, \w, \W, \s, and \S metasequences, as well as the egrep \< start-of-word and \> end-of-word metasequences.

Get the code for the latest regex here.






CLAWK Package

This package implements nearly all the features provided by the Unix AWK language, albeit with a pretty lisp-y flavor. In addition, it provides a large set of macros for doing more complicated processing beyond that provided by AWK.

To give you an idea of the flavor of CLAWK programming, here's an example from Ch 1 of The Awk Programming Language that calculates the number of employees, total pay for all employees, and average pay.

        { pay = pay + $2 * $3 }
    END { print NR, "employees"
          print "total pay is", pay
          print "average pay is", pay/NR
        }

The book gives the text file "emp.data" as example input.

Beth    4.00    0
Dan     3.75    0
Kathy   4.00    10
Mark    5.00    20
Mary    5.50    22
Suzie   4.25    18

When the awk program (called say, SAMPLE.AWK) is executed with the command

$ awk -f sample.awk emp.data

it produces the following output:

6 employees
total pay is 337.5
average pay is 56.25

Here's a straightforward translation of SAMPLE.AWK using the CLAWK package:

    (defawk sample (&aux (pay 0))
      (t   (incf pay ($* $2 $3)))
      (END ($print *NR* "employees")
           ($print "total pay is" pay)
           ($print "average pay is " (/ pay *NR*))) )

When this function is executed with (sample "emp.data"), it produces the same output:

6 employees 
total pay is 337.5 
average pay is 56.25 
NIL

(the NIL is the return value from the SAMPLE function).

That's nearly as concise as the original AWK. Here's a version that doesn't use the DEFAWK macro, and tries to be a bit more Lispish.

    (defun sample (filename &aux (pay 0))
      (do-file-fields (filename (name payrate hrsworked))
        (declare (ignore name))
        (incf pay (* (num payrate) (num hrsworked))))
      (format t "~%~D employees" *NR*)
      (format t "~%total pay is ~F" pay)
      (format t "~%average pay is ~F" (/ pay *NR*)) )

... which prints out the same thing as above.

Here's another example from The AWK Programming Language that shows off the regular expressions a little better. This program performs some basic validity checks on a Unix password file.

    BEGIN {
        FS = ";" }
    NF != 7 {
        printf("line %d, does not have 7 fields: %s\n", NR, $0)}
    $1 ~ /[^A-Za-z0-9]/ {
        printf("line %d, nonalphanumeric user id: %s\n", NR, $0)}
    $2 == "" {
        printf("line %d, no password: %s\n", NR, $0)}
    $3 ~ /[^0-9]/ {
        printf("line %d, nonnumeric user id: %s\n", NR, $0)}
    $4 ~ /[^0-9]/ {
        printf("line %d, nonnumeric group id: %s\n", NR, $0)}
    $6 !~ /^\// {
        printf("line %d, invalid login directory: %s\n", NR, $0)}

And this is the same program using CLAWK:

    (defawk checkpass ()
      (BEGIN
       (setq *FS* ";"))
      ((/= *NF* 7)
       (format t "~%line ~D, does not have 7 fields: ~S" *NR* $0))
      ((~ $1 #/[^A-Za-z0-9]/)
       (format t "~%line ~D, nonalphanumeric user id: ~S" *NR* $0))
      (($== $2 "")
       (format t "~%line ~D, no password: ~S" *NR* $0))
      ((~ $3 #/[^0-9]/)
       (format t "~%line ~D, nonnumeric user id: ~S" *NR* $0))
      ((~ $4 #/[^0-9]/)
       (format t "~%line ~D, nonnumeric group id: ~S" *NR* $0))
      ((!~ $6 #/^\//)
       (format t "~%line ~D, invalid login directory: ~S" *NR* $0)) )

Which you must admit is awfully close to the original AWK. Since this uses the #/ readmacro you must evaluate (clawk:install-regex-syntax) before evaluating this sample.

CLAWK summary

The INSTALL-REGEX-SYNTAX function sets up a readmacro on #\/ that parses to an unescaped #\/, and interns the resulting string in the current package. This (a) gives a nice AWK-ish syntax (b) results in an object that can be printed readably and interacts well in the listener even without Dynamic Windows, and (c) gives CLAWK a convenient place to hook the compiled matcher and some associated information, speeding things up nicely. You can still specify these things using the |...| syntax, but it's not quite as nice, since #\| is a common character in regex patterns.

The INSTALL-CMD-SYNTAX function sets up a readmacro on #` that parses to an unescaped ` character, and expands into a call to CLAWK::CALL-SYSTEM-CATCHING-OUTPUT. This function sends the command to the shell and returns an input stream containing the output of the command. This stream can then be used by CLAWK:FOR-STREAM-LINES or any of the CL stream functions. The readmacro is defined for all non-Symbolics systems, although currently only LispWorks is supported by CALL-SYSTEM-CATCHING-OUTPUT.

The various special variables *FS*, *NR*, *NF*, *RSTART*, *RLENGTH*, etc are fairly obvious if you've ever used AWK, as are the SUB, GSUB, SPLIT, INDEX, MATCH, SUBSTR, and ~ and !~ functions. $SUB, $GSUB, $SPLIT, $INDEX, $MATCH, and $SUBSTR do the same, but they attempt to coerce the various parameters to the right type (so you can pass in a string for a parameter that expects a number).

WITH-PATTERNS is a macro that cleans up the syntax for using dynamically-generated regex patterns. A future extension is for this to use the fast regex compiler instead of the closure-based regex compiler when the compiler is available and the pattern is a string literal.

WITH-FIELDS is a macro for doing line-splitting into variables. WITH-FIELDS takes a optional field list (in destructuring-bind syntax) and an optional string (default is the current line -- see DO-LINES) and field-separator (default is *FS*). If you don't give a set of variables to use, it will use the $ variables. It splits the line using SPLIT, and executes the body forms in an environment containing the variables. The $n variables are locally bound special, so they will revert on exit.

WITH-SUBMATCHES is a macro for processing register submatches. It takes a list of variables and a set of body forms, and binds the variables to the strings corresponding to the register submatches (or nil if the register didn't match), and evaluates the body forms in this environment. This is particularly handy for use in the consequent clauses of match-case.

MATCH-CASE is similar to CASE except that the value must be a string, and the cases are strings or regex symbols. Handy for simulating the implicit outer match structure of an AWK program, without being limited to just the toplevel. I tend to find that my serious AWK programs pretty quickly devolve into a BEGIN clause that calls a bunch of functions to do the real work, and this macro (along with the next few) really make this cleaner; allowing me to reuse that nifty auto-looping / splitting / matching mechanism wherever I need it.

MATCH-WHEN takes a set of forms similar to the toplevel CLAWK forms (with BEGIN, END, and other pattern clauses), but unlike the toplevel it does no looping. It executes all the BEGIN clauses first, then conditionally executes each of the pattern clauses, then executes all the END clauses. It winds up looking a lot like COND, except that it recognizes several special flavors of test expression, and is basically equivalent to

    (progn (when test1 . consequent1)
           (when test2 . consequent2)
           ...)

As with real AWK, the default action is to print the current line.

DO-FILE-LINES is a macro that takes a filename and a set of body forms. It opens the file in read mode, and loops over the lines in the file binding the line variable to the current line and executing the body forms. It doesn't do any line splitting, though it maintain the *NR*, *FNR*, and $0 variables. Executing (NEXT) within the body will restart the body at the next input line.

DO-STREAM-LINES is a closely related version that takes a stream (or t for the standard input stream).

DO-FILE-FIELDS is a macro that takes a filename, an optional field-list, and a set of body forms. It opens the file in read mode, and loops over the lines in the file splitting them into the specified field variables (or the $n variables if no field list was given), and executes the body forms in this new environment. Because of the splitting overhead, it's not as snappy as FOR-FILE-LINES. Executing (NEXT) within the body will restart the processing at the next input line.

DO-STREAM-FIELDS is a closely related version that takes a stream (or t for the standard input stream).

WHEN-STREAM-FIELDS and WHEN-FILE-FIELDS implement most of the CLAWK toplevel as a reusable special form. It takes an input stream (or file) and an optional field list, and a set of MATCH-WHEN clauses. The BEGIN clauses will be executed before the file is opened, the pattern clauses will be executed as with MATCH-WHEN, and the END clauses will executed after the file is closed, and the body. One limitation is it doesn't currently support range patterns (it's pretty simple to add, I just haven't gotten around to it).

DEFAWK is the highest-level macro, and one that closely mimics the awk toplevel. It takes a name, a parameter list consisting of &key and &aux parameters, and a set of MATCH-WHEN clauses. It binds the parameter list to the variable ARGS, and evaluates the BEGIN clauses. The BEGIN clauses are free to modify ARGS. Then it loops through the (possibly modified) ARGS, executing the pattern clauses for each stream and pathname in the list. It handles docstrings correctly.

The handling of the parameter list is perhaps the oddest thing about the DEFAWK macro. It is intended to closely match the behavior of a command-line AWK. A pleasant difference is that keyword parameters are automatically parsed out, instead of requiring you to do it manually like in real AWK.

The $+ , $-, $*, $/, $REM, $EXPT, $++, #=, $<, $>, $<=, $>=, $/=, $MIN, $MAX, $ZEROP, $LENGTH, $PRINT functions can take numbers and strings and do the AWK thing, coercing the arguments back and forth as necessary. $++ is the concatenation operator (in AWK this is indicated by space, which clearly doesn't work in the Lisp world). $LENGTH is overloaded on associative arrays (i.e. hashtables) to return the HASH-TABLE-COUNT.

The generic functions INT, STR, and NUM handle the common coercions. AWK does provide an INT function, but string coercions are done by concatenating the value with the empty string, and numeric coercions are done by adding 0 to the value. While these kludges also work in CLAWK, the NUM and STR functions are to be preferred.

The $ARRAY, $AREF, $FOR $IN and $DELETE functions (and macros) implement AWK-like associative arrays (including the amusing SUBSEP behavior for multiple indices), except that unlike AWK there's nothing to keep you from nesting these "arrays".

$0, $1, ..., $20 are symbol-macros that access the fields of the current line. $#0, $#1, ..., $#20 do also, but they attempt to interpret the value as a number.

%0, %1, ..., %20 are symbol-macros that index into the recent submatches. %#0, %#1, ..., %#20 do also, but they attempt to interpret the submatch value as a number.

This system also defines a clawk-user package.

Get the code for CLAWK here.






LEXER package

The LEXER package implements a lexical-analyzer-generator called DEFLEXER, which is built on top of both REGEX and CLAWK. Many of the optimizations in the recent rewrite of the regex engine went into optimizing the sorts of patterns generated by DEFLEX.

The default lexer doesn't implement full greediness. If you have a rule for ints followed by a rule for floats, the int rule will match on the part before the decimal before the float rule gets a change to look at it. You can fix this by specifying :flex-compatible as the first rule. This gives all patterns a chance to examine the text and takes the one that matches the longest string (first pattern wins in case of a tie). The down side of this option is that it slows down the analyser. If you can solve the issue by reordering your rules that's the way to do it.

I'm currently writing an AWK->CLAWK translator using this as the lexer, and it's working fine. As far as I can tell, the DEFLEXER-generated lexing functions should be fast enough for production use.

Currently, the LEX/FLEX/BISON feature of switching productions on and off using state variables is not supported, but it's a pretty simple feature to add. If you're using LEXER and discover you need this feature, let me know.

It also doesn't yet support prefix and postfix context patterns. This isn't quite so trivial to add, but it's planned for a future release of regex, so LEXER will be getting it someday.

Anyway, Here's a simple DEFLEXER example:

  (deflexer test-lexer
    ("[0-9]+([.][0-9]+([Ee][0-9]+)?)"
      (return (values 'flt (num %0))))
    ("[0-9]+"
      (return (values 'int (int %0))))
    ("[:alpha:][:alnum:]*"
      (return (values 'name %0)))
    ("[:space:]+") )

  > (setq *lex* (test-lexer "1.0 12 fred 10.23e45"))
  <closure>
 
  > (funcall *lex*)
  FLT
  1.0
 
  > (funcall *lex*)
  INT
  12
 
  > (funcall *lex*)
  NAME
  "fred"

  > (funcall *lex*)
  FLT
  1.0229999999999997E46

  > (funcall *lex*)
  NIL
  NIL

You can also write this lexer using the :flex-compatible option, in which case you can write the int and flt rules in any order.

(deflexer test-lexer
  :flex-compatible
  ("[0-9]+"
    (return (values 'int (int %0))))
  ("[0-9]+([.][0-9]+([Ee][0-9]+)?)"
    (return (values 'flt (num %0))))
  ("[:space:]+")
 )

Get the code for LEXER here.


Michael Parker
1