Article
A Java Cross-Reference Tool
  
 
 

Articles Index


Introduction

Have you ever wanted to build a tool that examined or processed Java files? For example, you might be interested in a tool that enforced a coding style or a tool that automatically augmented programs with debugging code snippets (assertions). Building a real parser for Java with a complete symbol table is nontrivial and most programmers have shied away from such a task, instead relying on partial solutions using PERL.

For this final article in the parsing series, we built a Java cross-referencing tool that accumulates complete symbol table information using a Java parser generated by the ANTLR parser generator. The previous article in this series discussed the generation of "tag files" needed to build a code browser and made use of an older version of ANTLR that generated parsers implemented in C++. The tag file generator was almost exclusively a Java parser (no symbol table manager) and, hence, our discussion concentrated on the Java grammar. In this article, we focus on the design of a Java symbol table manager and its interaction with a Java parser. We also make a small detour to describe the latest version of ANTLR that generates parsers implemented in Java.

What is a cross-reference tool?

A Java program consists of definitions and references. When you write a program, you define constructs like classes, methods and variables. These definitions include references to other Java constructs. Classes may refer to another class as their superclass; variables refer to a class or primitive as their type; statements may refer to methods to delegate processing.

A cross-reference tool describes these relationships by parsing one or more Java source files and determining which constructs are actually being referred to. These relationships are presented in a report to describe which constructs refer to other constructs and where a given construct is used.

This article discusses some of the key concepts involved in building a cross reference tool. We start by discussing the functionality that is implemented and work through some of the design issues encountered while building the tool.

Functionality

Before writing code, we should decide what the tool will do:

  • Support full Java 1.1 syntax (including inner classes).
  • Accept (via the command line) a list of directories to search (and that search is recursive).
  • Parse each java source file in the directories specified definitions and references.   Note that java source files will be recognized as any files that have ".java" as a suffix.
  • Create a symbol table that contains a list of all definitions in the source files and references to those definitions.
  • Generate a report of all defined symbols and where those symbols are referenced.
  • Ignore compiled Java bytecode files (class files).

Because this tool only parses Java source, some definitions will be unavailable.   For example, if the source references java.awt.Color and you will not be parsing java/awt/Color.java during this run of the tool, the tool cannot properly resolve the reference. This is a bit of a problem because we might not be able to properly resolve method invocations. For example:

java.util.Vector v = new java.util.Vector();
hashTable.put(java.awt.Color.blue);

Suppose that Hashtable had two different "put" methods. One takes a an Object, one takes a Component. (It doesn't, but suspend disbelief for a moment.)   We cannot determine which is the correct fit because we do not know what the superclass of Color is. (For that matter, we don't even know that Color.blue is a Color...)

Therefore, we settle on a slightly weaker method resolution algorithm. Assuming the Java source is valid, chances are pretty good that we can use the number of parameters to find the correct method. Obviously this won't always work, but most methods are not overridden and a good number that are differ by number of parameters.

Of course there are many more things that this tool could do and the interested reader can look at Possible Extensions for some ideas.

Sample Input and Output

The following is a simple java class (that does nothing) to demonstrate the cross-reference tool's output. When run with the following input:

package simple;

import java.awt.Frame;

public class Scrubby extends Frame {
    int y;

    public Scrubby() {
        f(15);
    }

    int f(int x) {
        return x+y;
    }
}

the cross-reference tool generates the following report:

e:/test/ # java JavaXref simple
Parsing...
   e:\test\simple\Scrubby.java
Resolving types...
resolving: simple.Scrubby
Writing report...

Cross Reference Report
======================

Package: simple
-----------------------------------------
  simple.Scrubby (Class) [Scrubby.java:5]
    Superclass: Frame [Scrubby.java:3]
    Imported classes/packages:
      Frame

simple.Scrubby.~constructor~ (Method) [Scrubby.java:8]

   simple.Scrubby.y (Variable)  [Scrubby.java:6]
    Type: int
    Referenced:
    File: e:\test\simple\Scrubby.java  Line: 13
    simple.Scrubby.f (Method) [Scrubby.java:12]
     Referenced:
       File: e:\test\simple\Scrubby.java  Line: 9
     Return type: int
      Parameters:
     simple.Scrubby.f.x (Variable)  [Scrubby.java:12]
         Type: int
        Referenced:
     File: e:\test\simple\Scrubby.java  Line: 13
     

A Simple Symbol Table for the Java Language

Java programs are composed of definitions and references. You define classes, methods and variables and reference them in statements and other definitions. Each class, method and variable has a name. These names are called symbols when used in a parser.   A symbol represents one specific entity defined in your program. Multiple symbols may appear to be the same, but are actually separate definitions based on their location in the source files. Our cross reference tool needs to keep track of every symbol that is defined and where those symbols are referenced.

A parser tracks symbol information in a data structure called a symbol table.   The symbol table contains a list of all Java packages that it has encountered, a table of all unique strings (for storage efficiency) and a Stack of scoped-definitions that represent the current parse state. Our cross-reference tool uses a class called SymbolTable to represent the symbol table.

A scope is a section of code over which a definition is visible/usable. Every symbol is defined within a certain scope. Scopes can be nested. For example, a class defines a scope that contains all the variables and methods defined in that class.   A method defines a scope that contains all its parameters and local variable definitions. A stack is commonly used to represent this scope nesting. The innermost scope (containing the text being parsed) is always at the top of the stack.   As the parse proceeds into new scopes, the new scopes are pushed onto the stack.   As the parse exits scopes, the scope definition is popped from the stack. One of the key elements of name lookup is examining each element of the stack (starting at the top) to see if that scope contains the name we are looking for.

All symbol definitions are represented in our symbol table as Definition objects. Definition is an abstract class that represents some basic, common information about all symbols in our symbol table. Java language constructs such as classes, interfaces and methods all have subclasses that ultimately extend Definition.   Most of the classes in package com.magelang.symtab are extended from Definition and provide added information based on the Java construct they represent.

There are two unusual class definitions in this symbol table which are worth mentioning.   PrimitiveDef represents a primitive type in the source program. Note that PrimitiveDef extends ClassDef.   This is not really "correct" from a modeling standpoint. However, subclassing ClassDef has a great advantage: widening primitive type conversions can be treated exactly the same as widening class conversions. This simplifies the code for method parameter type checking because we avoid the special case of "if the parameter is a primitive" and treat all parameters as objects.   (Note that because we only use the number of parameters to resolve methods, this advantage is not used. It is presented to show a possible extension to the tool.)

The other interesting case is the MultiDef class. MultiDef collects all definitions in a scope that have the same name, such as overloaded methods or a method that has the same name as a class variable. MultiDef defines an interesting method lookup algorithm that selects the appropriate method from a set of applicable methods for a given method call.

During the parse the symbol table is loaded with information about symbol definitions taken from the context of those definitions. This information is used later to find which symbol a name in the source refers to and store that reference information for later inclusion in the cross-reference report.

We provide a class diagram and descriptions of all classes that may help you understand how the symbol table classes are related.   The symbol table described may seem complex, but it is actually very simple compared to many, especially one used for C++.

How to Collect the Necessary Information

There are several possible designs for the overall processing necessary to build the symbol table and resolve references. The Java language allows classes to be defined after they have been used in a source file. To properly resolve a reference to a class, we can

  • "Walk, then resolve". During the parse, we watch for definitions and references. Definitions are recorded in the symbol table and references are recorded as strings. After the parse has been completed for all files,  all definitions have been seen. We then walk through all symbols in the symbol table and resolve the references.
  • "Backpatch" unknown definitions. This technique uses placeholder objects in the symbol table to say "I have not yet been defined" for any types that have not yet been seen. When the unknown symbol is finally defined, the parser will replace the placeholder references with the reference to the real definition.

    The difference between this and the first approach is that we never do a full walk through the symbol table to resolve references; we resolve as we see definitions.

Backpatching is the more efficient approach, but adds a great deal of complexity to the process. For simplicity, the cross-reference tool described for this article implements the first strategy.

The tool actually performs its processing in three phases: parse the source code to determine definitions and references, resolve references in the contents of the symbol table and generate the cross-reference report. In more detail:

  1. Determine Definitions and Note References
    The first phase of the cross-reference tool walks through all the source code in the specified directories (and their subdirectories.)  The parser collects information on the following constructs:
    • Package Definitions
    • Class Definitions
    • Interface Definitions
    • Method Definitions
    • References to other symbols

    For each of the above definitions found, a new symbol is created. This symbol may, in turn, reference other symbols such as a superclass that is being extended as part of a class definition. During the first pass, these references might not yet be available; they could be defined later in the same source file or in another file that has yet to be parsed. Because of this, any references to superclasses, implemented interfaces, return types for methods and parameter types are stored as placeholders by their names only. These placeholders are held in an instance of the DummyClass class.  These will also be resolved during phase two.

  2. Resolve Definition References
    After the source-code parse, we have a symbol table that lists all constructs defined in the source files. Each of these definitions may reference other definitions (for superclasses, variable types and so on). This second phase will walk through all definitions in the symbol table and resolve those references. Most of the symbol table classes implement a method called resolveTypes that is used to perform this resolution.

    At the conclusion of this pass, our symbol table contains all symbols defined in the parsed source files and proper resolutions to defined symbols. Next, we move on to finding references to our defined symbols.

  3. Report Generation
    This final phase looks at the data contained in the symbol table and generates a cross-reference report.

Writing the Code

Let's discuss the thought process that goes into writing a tool like this. We do not have enough time or space to write a tutorial, but we will mention some key concepts and methods for the tool's implementation.

The tool was developed in the following steps:

  1. Select the tools.
  2. Write a recognizer.
  3. Design and Implement a symbol table.
  4. Add actions for definitions to the grammar.
  5. Add actions for references to the grammar.
  6. Resolve References
  7. Create report generation routines.

Select The Tools

The first choice we made was to decide which tools we would use to create the cross-reference tool.   "Tools" in this instance means the language and parser generator.   We selected Java as our language and this narrowed the scope of available parser generation tools. Fortunately ANTLR 2.0 exists and is an excellent choice as that parser generator!

Write a Recognizer

The tool began as a simple Java 1.1 recognizer. We started with an LALR(1) grammar for the Java language (as defined in The Java Language Specification, JLS for short) and converted it to a valid LL(k) grammar. 

The recognizer grammar defines our basic scanner and parser. This allows us to recognize any valid Java compilation unit.   This recognizer is similar to the browser described in the last article, without any embedded action code. Because recognition was discussed in an earlier article, we won't discuss it further here.

To compile and try this recognizer, issue the following commands. Note that you will need to have installed ANTLR 2.0 per its installation instructions.

java antlr.Tool javar.g
javac *.java
java JavaRecognizer
dir1 dir2 dir3 ... dirn

where dir1...dirn are some directories containing java source files to recognize.

A few simple scripts are available to make building and running the tool simpler.   Look in the recognizer source directory for these and a README that explains their use.

Note that this grammar file actually constitutes a separate tool from the cross-reference tool, one that checks syntax of Java source files.

Design and Implement a Symbol Table

A recognizer is nice, but it has no memory of what was parsed. That's where a symbol table becomes meaningful. The tool needed to track and resolve references, so the next thing we did was add a symbol table to track definitions.

Symbol table design is not trivial. There are several common patterns used, but symbol tables for each language tend to differ greatly. We describe some of the requirements for our Java symbol table and some basic algorithms to implement those requirements.

How to Store Data

We decided to store symbol information starting with a Java package. A Java package can contain classes and interfaces. Classes and interfaces can contain methods, variables and inner classes and interfaces. These are but a few of the relationships between symbol objects. See Symbol Table Classes for a more complete description.

The symbol table stores a Vector of all package definitions that we have parsed. We can walk this Vector to access any symbol that was defined in the parsed source files. The symbol table also contains a stack of scopes that represent the current parse state.

Symbol Lookup

One of the basic constructs in nearly every symbol table is a stack of scopes.   Each scope that you enter while parsing a source file gets pushed onto this stack, giving you a simple lookup mechanism for "normal" variables. Suppose you were writing a symbol table for a language like Pascal that has nested procedures and no inheritance. The only lookup concern is "when I see x, is that x in the current scope or in some scope containing the current scope."  The lookup algorithm is very simple:

lookup1(name):
for each scope on the scope stack(starting at the top)
  if name is defined in that scope
    we've found it!
 otherwise, try the next enclosing scope
 

This is the basis of symbol lookup in nearly every computer language. Note that the cross-reference tool doesn't actualy define a lookup1 method; this action is built into the SymbolTable's lookup method. Method lookup1 is used here as an example to make the processing easier to see.

Looking up Qualified Names

To complicate matters a bit, many languages allow explicit qualification of names. For example, a name like x.y might mean "lookup y in the scope of x". The lookup algorithm was changed slightly by allowing a specific scope and only that specific scope to be searched. In our parser, when we see the x we want to make sure that the next symbol is looked up in x's scope. So we invented a routine named setScope that sets a variable in the symbol table to the specific scope to search.   When we see x.y, the following processing occurs:

 setScope(lookup("x"));
 lookup("y");

and lookup is defined as

 lookup(name):
 if a specific scope is set
   look in that scope for name
   clear the specific scope
   return the result
  else
      lookup1(name)
 

Adding Inheritance

Next we added inheritance to the mix. This really isn't as bad as it may seem, though. The trick was to just modify the lookup1 method slightly:

lookup1(name):
 for each scope s on the scope stack (
                                 starting at the top)
 scope to search = s
  while not found and s is valid 
  if name is defined in s
    we've found it!
     else if s has a superclass
        s = superclass of s
          else
         s = NOT VALID
  otherwise, try the next containing scope
  

We nested an inheritance search inside the normal scoped search. A similar modification is made to the lookup method and similar processing is used to lookup a name in an implemented interface. (Note that the interface lookup proceeds after the inheritance lookup.)

Overloading Symbols

What do we do about overloading?  In a language like Java, several methods can have the same name and/or the same name as a variable in the class. Each method is unique based on the types of the parameters to that method. A proper method lookup modifies the above algorithms to achieve the following effect:

get the set of all methods that are applicable
bestSofar = first applicable method
for each other applicable method "challenger"
 if challenger is more specific than bestSoFar
     bestSoFar = challenger

An applicable method is one where the actual parameter types are either exact matches for the method's formal parameter types or can be widened to the formal parameters.   (Widening refers to a "widening conversion"; see the Java Language Specification for details. In short, think actuals must be subclasses or same type as formals. Primitives are where you need to consult the JLS.)

A method is more specific if at least one of its formal parameter types can be widened to a parameter type in the other method. A simple (but sneaky) way to test this is to see if one method is applicable for the other method's formal parameter types.   If so, the other method is more specific. (Read that about fifty times and it might make some sense. The JLS explains it in a similar manner.)

Because we don't have access to all superclasses of all types (because we are not parsing compiled files) we had to select an alternate method of resolving overloaded methods. Instead of using the above algorithm, we simplified it to a simple match of the number of parameters. This is effective for most methods, but fails in cases where overloaded methods have the same number of parameters.

Storage in the symbol table consists primarily of hash tables of elements. Each symbol that defines a scope contains a hash table of all elements in its scope. This works well if all elements have unique names. However, if some elements have the same name (overloaded methods) we need an alternate storage mechanism.

That's where the MultiDef class comes into play. MultiDef is a list of other Definitions. If the name lookup returns a MultiDef, we ask the MultiDef to resolve further using the number of parameters (or -1 if we are looking for a variable reference.)

Now that you've had a high-level overview of the symbol table processing, take a look at the symbol table source code.

Add Actions for Definitions to the Grammar

Now that we've covered the background information, we can discuss interaction between the recognizer and the symbol table. We created an instance of the symbol table in the parser and added calls to store all definition and reference information in that symbol table. If you look at the source for the final cross reference grammar, you'll see the symbol table being created in the main method and passed to each parser that is created. A separate parser instance is created for each file to be parsed, but they all share the same symbol table. The parser also defines several convenience methods that just pass data to the symbol table. We recommend this approach because you can easily add debug code in these wrapper methods to track the data that is being passed to the symbol table.

The first thing necessary to track definition information is to expand the common token type that is used by the recognizer. The recognizer uses antlr.CommonToken as the object that is passed from the scanner to the parser representing a token. We needed to add a little information that can be passed from the parser to the symbol table.   We extend CommonToken creating a new class named JavaToken. JavaToken adds a File reference and parameter count, information that can be used when resolving referenced symbols.

Next we determined where Java constructs are defined in the grammar. A few examples are

  • classDefinition
  • packageDefinition
  • variableDeclarator
  • compoundStatement
  • methodHead

(the above is not an exhaustive list.)

At each place where a java definition construct is recognized, we added code to inform the symbol table. Using methodHead as an example:

// This is the header of a method.
// It includes the name and parameters
//   for the method.
//   This also watches for a list of exception  
 //classes in a "throws" clause.
  methodHead
 :  IDENT  // the name of the method

// parse the formal parameter declarations.
  LPAREN (parameterDeclarationList)? RPAREN

// again, the array specification is skipped...
  (LBRACK RBRACK)*

// get the list of exceptions that this 
//method is declared to throw
  (throwsClause)?
    ;
    

After we see the method name, we need to tell the symbol table enough information to define a MethodDef object. We also need to tell it to associate the parameters with the method.

To create the method definition, we need to pass the following information to the symbol table:

  • method name
  • return type

Uh oh. We don't have the return type!  When we created the LL(k) grammar, we factored out the type because field could be either a method or variable definition (as well as a few other things):

  field :
  ...
 typeSpec  // method or variable declaration(s)
 (  methodHead ( compoundStatement | SEMI )
 |  variableDefinitions SEMI
 )
 

We must keep typeSpec in field, so how can we have it available in methodHead?   Two possibilities:

  1. Pass the type information into methodHead.
  2. Return the other method information from methodHead and define the method in field.

We preferred the first alternative, as it helps keep the method definition call with the method match. So we modified typeSpec to return a JavaToken

typeSpec returns [JavaToken t]
 {t=null;}
 :  t=type (LBRACK RBRACK )*
 ;
 

And pass that token up through type to typeSpec. Then we modified field to use that type:

 field
  {JavaToken type;}
:  // method, constructor or variable declaration

 ...

| type=typeSpec //method or variable declaration(s)
( methodHead[type]
( compoundStatement[BODY] | SEMI {popScope();})
|  variableDefinitions[type] SEMI
)


This grabs the type and passes it down into methodHead. Meanwhile, methodHead needed to be changed to accept the type and pass the data to the symbol table:

methodHead[JavaToken type]
{JavaVector exceptions=null;}
// to keep track of thrown exceptions
:  method:IDENT  // the name of the method

// tell the symbol table about it.  Note that 
//this signals that 
// we are in a method header so we handle 
//parameters appropriately
   {defineMethod((JavaToken)method, type);}

// parse the formal parameter declarations. 
// These are sent to the
// symbol table as _variables_. 
// Because the symbol table knows we
// are in a method header, it collects these
// definitions as parameters
// for the method.
   LPAREN (parameterDeclarationList)? RPAREN

// again, the array specification is skipped...
   (LBRACK RBRACK)*

// get the list of exceptions that this method
// is declared to throw
   (exceptions=throwsClause)?

// tell the symbol table we are done with
// the method header. Note that
// this will tell the symbol table to handle
// variables normally
   {endMethodHead(exceptions);}
    ;

Method defineMethod tells the symbol table that we are defining a new method. It also sets a symbol table flag that says "any variable definitions should be added as parameters to the current method header". Method endMethodHeader halts that behavior. Similar actions need to be added anywhere the grammar recognizes that a construct is defined in a Java program.

As a check, we defined a few dump methods in the symbol table. We ran the parser nd dumped the symbol table contents to ensure that all definitions were added properly.   This is an important thing to do at this point, as you should make sure your definitions are properly added to the symbol table before trying to add and resolve references.

Add Actions for References to the Grammar

Next we added actions that tell the symbol table whenever we reference a symbol.   Be careful here, as it's very easy to have references double up if you have a reference call in a low-level rule and in one that eventually calls that low-level rule. The places you need to flag references are

  • labels for break and continue statements.
  • the type in a cast expression.
  • the type in an instanceof expression.
  • the id in a postfixExpression.

Other references are set up automatically by the symbol table. For example, when defining a class that has a superclass, a reference to the superclass is automatically set up.

Reference actions are very simple and look like

 // get out of a loop (or switch)
 // tell the symbol table that we are (
 //possibly) referencing a label
  |  "break" (bid:IDENT {reference(
                      (JavaToken)bid);})? SEMI

(which is defined in statement.)  The reference method tells the symbol table about the token (which contains location information) and the symbol table simply stores the reference in the current scope's unresolved reference list.

Resolve References

The symbol table then needed to check all of its definitions and see which have unresolved references. These references could be a superclass for a class or a variable used in an expression.

Each symbol may have nothing to resolve or several things to resolve. It can be very difficult things to properly resolving references without entering an infinite loop, especially with circular definitions. In some cases we added a variable that tracks that if an object (such as a Vector) is currently being resolved, preventing circular inifinte resolution attempts.

The symbol table classes define methods named resolveTypes and in some cases resolveRefs.  Examine these methods to see what was done to perform this resolution. In particular, examine the lookup methods defined in SymbolTable and how they deal with qualified names.

An important note: the language specification is extremely important, especially with regard to name lookup. Whenever you are writing a parser, do not assume you know the language you are parsing. Thoroughly read the language specification to ensure you see all the details before starting to write your grammar. Follow this link to The Java Language Specification.

Create Report Generation Routines

Writing a fairly readable report is the final task. We tend to like nicely indented reports, where indentation shows scope containment of other definitions. To facilitate this, we defined a class named IndentingPrintWriter that subclasses PrintWriter.  We overrode all the print and println methods to write a certain padding string before anything on a line and defined an indent and dedent method to change the amount of padding.

We finally defined a report(IndentingPrintWriter) method in each symbol table class to print information about the class, indent the print writer, report information about contained elements and dedent the print writer.

The Source Code

All of the source code for the cross reference tool is provided. To execute the cross reference tool, issue the following commands:


java antlr.Tool java.g
javac *.java
java JavaXRef dir1 dir2 dir3 ... dirn

where dir1...dirn are the directories or files to parse.

A few simple scripts are available to make building and running the tool simpler.   Look in the crossref source directory for these and a README that explains their use.

To compile and try this recognizer, you will need to have installed ANTLR 2.0 per its installation instructions.

All source code presented in this article is free for your use with no restrictions whatsoever. MageLang Institute and JavaSoft retain no copyright to the source code, nor do they provide any guarantee of the code's suitability for any purpose. As always, use of this code is at your own risk.

In addition, Symbol Table Classes describes each of the classes used in the symbol table and has links to each symbol table file independently.

ANTLR 2.00, or PCCTS: Part Deux

The tag generation tool described in the previous article of this parsing series was written using PCCTS (The Purdue Compiler Construction Tool Set), a tool with a long history (going back to the late 1980’s). PCCTS could only generate C or C++, but we used it anyway because the new (Java) version was only in alpha release. We are happy to report that the complete rewrite of PCCTS in Java to generate Java, called ANTLR (ANother Tool for Language Recognition), is available and was used to implement the cross-reference tool described above. PCCTS was renamed ANTLR because no one knew how to pronounce PCCTS (although someone suggested that the official pronunciation should be "Fred"). To help you read the grammar provided with this article, we need to describe the primary change when moving from PCCTS to ANTLR: a radical departure from conventional lexical analysis (scanning).

ANTLR generalizes the notion of scanning, parsing and tree parsing into the single abstraction of applying grammatical structure to an input stream (be it a stream of characters, tokens or tree nodes). A grammar file in ANTLR now consists of (grammar) class definitions using a single consistent notation rather than different notations for parsing and scanning. Language tools have traditionally used different notations for the seemingly different tasks. For example, programmers used regular expression notation to describe input symbols and BNF (Backus-Naur Form) notation to describe language structure. In other words, programmers had to use the regular expression

 [a-z]+

(one or more characters from ‘a’ to ‘z’) to describe an identifier and


declaration
: type IDENTIFIER ASSIGN expression SEMICOLON
;

to describe the grammatical structure of a declaration in their language of interest. ANTLR uses the same extended-BNF notation to describe both. So, in a lexical grammar, you can specify:

class MyPascalLexer extends Lexer;
…
IDENTIFIER
: ( 'a'..'z' )+
      ;

Along with this wonderful unification ("syntactic yummy"), ANTLR generates much stronger scanners than the finite-automata scanners generated by traditional tools such as lex and DLG (the lexer-generator in PCCTS). The new scanner mechanism is better because you can

  • have semantic and syntactic predicates and use k>1 lookahead.
  • read and debug the output as it's very similar to what you would build by hand (that is, it's not a table full of integers like automata-based scanners).
  • have actions executed during the recognition of a single token.
  • recognize complicated tokens such as HTML tags or "executable" comments like the javadoc @-tags inside /** ... */ comments. The lexer has a stack, unlike a finite automaton, so you can match nested structures such as nested comments.

The increase in scanning power comes at the cost of some inconvenience in scanner specification and indeed requires a shift in your thoughts about lexical analysis. Finite automata do automatic left-factoring of common (possibly infinite) left-prefixes. In a hand-built scanner, you have to find and factor out all common prefixes. For example, consider writing a lexer to match integers and floats. The regular expressions are straightforward:

  INTEGER:"[0-9]+" ;
  REAL:"[0-9]+{.[0-9]*}|.[0-9]+" ;
  

Building a scanner for this by hand or using ANTLR's scanner mechanism would require factoring out the common [0-9]+. We found in practice, however, that the advantages seriously outweigh this one inconvenience.

For a full description of ANTLR's lexical mechanism, please see the documentation and resources at the ANTLR Home Page.

Conclusion

We conclude this series with this cross reference tool. It's a real tool that demonstrates the power of parsing technology. Cross-referencing is a common task and provides a glimpse into several issues related to parser development. A real tool is not just a simple grammar that recognizes input of a given source language;   it needs other constructs that work in concert with the parser. We used the "latest and greatest" version of ANTLR to implement the tool, parsing Java source files using Java itself. We demonstrated how a symbol table can be connected with a parser to give the parser some memory and resolve references between definitions in that symbol table.

Where to Go for More Information

ANTLR - ANother Tool for Language Recognition
The Java Home Page
The Java Language Specification
ANTLR Usenet Newsgroup
General Compilers/Parsers Newsgroup

copyright © Sun Microsystems, Inc