![]() |
![]() |
![]() ![]() |
![]() |
![]() |
IntroductionHave you ever wanted to build a tool that examined or processed Java files? For example, you might be interested in a tool that enforced a coding style or a tool that automatically augmented programs with debugging code snippets (assertions). Building a real parser for Java with a complete symbol table is nontrivial and most programmers have shied away from such a task, instead relying on partial solutions using PERL. For this final article in the parsing series, we built a Java cross-referencing tool that accumulates complete symbol table information using a Java parser generated by the ANTLR parser generator. The previous article in this series discussed the generation of "tag files" needed to build a code browser and made use of an older version of ANTLR that generated parsers implemented in C++. The tag file generator was almost exclusively a Java parser (no symbol table manager) and, hence, our discussion concentrated on the Java grammar. In this article, we focus on the design of a Java symbol table manager and its interaction with a Java parser. We also make a small detour to describe the latest version of ANTLR that generates parsers implemented in Java. What is a cross-reference tool?A Java program consists of definitions and references. When you write a program, you define constructs like classes, methods and variables. These definitions include references to other Java constructs. Classes may refer to another class as their superclass; variables refer to a class or primitive as their type; statements may refer to methods to delegate processing. A cross-reference tool describes these relationships by parsing one or more Java source files and determining which constructs are actually being referred to. These relationships are presented in a report to describe which constructs refer to other constructs and where a given construct is used. This article discusses some of the key concepts involved in building a cross reference tool. We start by discussing the functionality that is implemented and work through some of the design issues encountered while building the tool. FunctionalityBefore writing code, we should decide what the tool will do:
Because this tool only parses Java source, some definitions will be
unavailable.
For example, if the source references java.util.Vector v = new java.util.Vector(); hashTable.put(java.awt.Color.blue); Suppose that Therefore, we settle on a slightly weaker method resolution algorithm. Assuming the Java source is valid, chances are pretty good that we can use the number of parameters to find the correct method. Obviously this won't always work, but most methods are not overridden and a good number that are differ by number of parameters. Of course there are many more things that this tool could do and the interested reader can look at Possible Extensions for some ideas. Sample Input and OutputThe following is a simple java class (that does nothing) to demonstrate the cross-reference tool's output. When run with the following input:
the cross-reference tool generates the following report:
A Simple Symbol Table for the Java LanguageJava programs are composed of definitions and references. You define classes, methods and variables and reference them in statements and other definitions. Each class, method and variable has a name. These names are called symbols when used in a parser. A symbol represents one specific entity defined in your program. Multiple symbols may appear to be the same, but are actually separate definitions based on their location in the source files. Our cross reference tool needs to keep track of every symbol that is defined and where those symbols are referenced. A parser tracks symbol information in a data structure called a symbol
table.
The symbol table contains a list of all Java packages that it has
encountered, a
table of all unique strings (for storage efficiency) and a Stack of
scoped-definitions
that represent the current parse state. Our cross-reference tool uses a class
called
A scope is a section of code over which a definition is visible/usable. Every symbol is defined within a certain scope. Scopes can be nested. For example, a class defines a scope that contains all the variables and methods defined in that class. A method defines a scope that contains all its parameters and local variable definitions. A stack is commonly used to represent this scope nesting. The innermost scope (containing the text being parsed) is always at the top of the stack. As the parse proceeds into new scopes, the new scopes are pushed onto the stack. As the parse exits scopes, the scope definition is popped from the stack. One of the key elements of name lookup is examining each element of the stack (starting at the top) to see if that scope contains the name we are looking for. All symbol definitions are represented in our symbol table as
There are two unusual class definitions in this symbol table
which are
worth mentioning. The other interesting case is the During the parse the symbol table is loaded with information about symbol definitions taken from the context of those definitions. This information is used later to find which symbol a name in the source refers to and store that reference information for later inclusion in the cross-reference report. We provide a class diagram and descriptions of all classes that may help you understand how the symbol table classes are related. The symbol table described may seem complex, but it is actually very simple compared to many, especially one used for C++. How to Collect the Necessary InformationThere are several possible designs for the overall processing necessary to build the symbol table and resolve references. The Java language allows classes to be defined after they have been used in a source file. To properly resolve a reference to a class, we can
Backpatching is the more efficient approach, but adds a great deal of complexity to the process. For simplicity, the cross-reference tool described for this article implements the first strategy. The tool actually performs its processing in three phases: parse the source code to determine definitions and references, resolve references in the contents of the symbol table and generate the cross-reference report. In more detail:
Writing the CodeLet's discuss the thought process that goes into writing a tool like this. We do not have enough time or space to write a tutorial, but we will mention some key concepts and methods for the tool's implementation. The tool was developed in the following steps:
Select The ToolsThe first choice we made was to decide which tools we would use to create the cross-reference tool. "Tools" in this instance means the language and parser generator. We selected Java as our language and this narrowed the scope of available parser generation tools. Fortunately ANTLR 2.0 exists and is an excellent choice as that parser generator! Write a RecognizerThe tool began as a simple Java 1.1 recognizer. We started with an LALR(1) grammar for the Java language (as defined in The Java Language Specification, JLS for short) and converted it to a valid LL(k) grammar. The recognizer grammar defines our basic scanner and parser. This allows us to recognize any valid Java compilation unit. This recognizer is similar to the browser described in the last article, without any embedded action code. Because recognition was discussed in an earlier article, we won't discuss it further here. To compile and try this recognizer, issue the following commands. Note that you will need to have installed ANTLR 2.0 per its installation instructions. java antlr.Tool javar.g javac *.java java JavaRecognizer dir1 dir2 dir3 ... dirn where dir1...dirn are some directories containing java source files to recognize. A few simple scripts are available to make building and running the tool simpler. Look in the recognizer source directory for these and a README that explains their use. Note that this grammar file actually constitutes a separate tool from the cross-reference tool, one that checks syntax of Java source files. Design and Implement a Symbol TableA recognizer is nice, but it has no memory of what was parsed. That's where a symbol table becomes meaningful. The tool needed to track and resolve references, so the next thing we did was add a symbol table to track definitions. Symbol table design is not trivial. There are several common patterns used, but symbol tables for each language tend to differ greatly. We describe some of the requirements for our Java symbol table and some basic algorithms to implement those requirements. How to Store DataWe decided to store symbol information starting with a Java package. A Java package can contain classes and interfaces. Classes and interfaces can contain methods, variables and inner classes and interfaces. These are but a few of the relationships between symbol objects. See Symbol Table Classes for a more complete description. The symbol table stores a Symbol LookupOne of the basic constructs in nearly every symbol table is a stack of scopes. Each scope that you enter while parsing a source file gets pushed onto this stack, giving you a simple lookup mechanism for "normal" variables. Suppose you were writing a symbol table for a language like Pascal that has nested procedures and no inheritance. The only lookup concern is "when I see x, is that x in the current scope or in some scope containing the current scope." The lookup algorithm is very simple:
This is the basis of symbol lookup in nearly every computer language. Note
that
the cross-reference tool doesn't actualy define a Looking up Qualified NamesTo complicate matters a bit, many languages allow explicit qualification of
names. For example, a name like setScope(lookup("x")); lookup("y"); and
Adding InheritanceNext we added inheritance to the mix. This really isn't as bad as it may
seem,
though. The trick was to just modify the
We nested an inheritance search inside the normal scoped search. A similar
modification is made to the Overloading SymbolsWhat do we do about overloading? In a language like Java, several methods can have the same name and/or the same name as a variable in the class. Each method is unique based on the types of the parameters to that method. A proper method lookup modifies the above algorithms to achieve the following effect: get the set of all methods that are applicable bestSofar = first applicable method for each other applicable method "challenger" if challenger is more specific than bestSoFar bestSoFar = challenger An applicable method is one where the actual parameter types are either exact matches for the method's formal parameter types or can be widened to the formal parameters. (Widening refers to a "widening conversion"; see the Java Language Specification for details. In short, think actuals must be subclasses or same type as formals. Primitives are where you need to consult the JLS.) A method is more specific if at least one of its formal parameter types can be widened to a parameter type in the other method. A simple (but sneaky) way to test this is to see if one method is applicable for the other method's formal parameter types. If so, the other method is more specific. (Read that about fifty times and it might make some sense. The JLS explains it in a similar manner.) Because we don't have access to all superclasses of all types (because we are not parsing compiled files) we had to select an alternate method of resolving overloaded methods. Instead of using the above algorithm, we simplified it to a simple match of the number of parameters. This is effective for most methods, but fails in cases where overloaded methods have the same number of parameters. Storage in the symbol table consists primarily of hash tables of elements. Each symbol that defines a scope contains a hash table of all elements in its scope. This works well if all elements have unique names. However, if some elements have the same name (overloaded methods) we need an alternate storage mechanism. That's where the Now that you've had a high-level overview of the symbol table processing, take a look at the symbol table source code. Add Actions for Definitions to the GrammarNow that we've covered the background information, we can discuss interaction
between
the recognizer and the symbol table. We created an instance of the symbol table
in
the parser and added calls to store all definition and reference information in
that
symbol table. If you look at the source for the final cross reference grammar,
you'll see
the symbol table being created in the The first thing necessary to track definition information is to expand the
common token
type that is used by the recognizer. The recognizer uses
Next we determined where Java constructs are defined in the grammar. A few examples are
(the above is not an exhaustive list.) At each place where a java definition construct is recognized, we added code
to inform
the symbol table. Using
After we see the method name, we need to tell the symbol table enough
information to
define a To create the method definition, we need to pass the following information to the symbol table:
Uh oh. We don't have the return type! When we created the
LL(k)
grammar, we factored out the
We must keep
We preferred the first alternative, as it helps keep the method definition
call with
the method match. So we modified
And pass that token up through
This grabs the type and passes it down into
Method As a check, we defined a few dump methods in the symbol table. We ran the parser nd dumped the symbol table contents to ensure that all definitions were added properly. This is an important thing to do at this point, as you should make sure your definitions are properly added to the symbol table before trying to add and resolve references. Add Actions for References to the GrammarNext we added actions that tell the symbol table whenever we reference a symbol. Be careful here, as it's very easy to have references double up if you have a reference call in a low-level rule and in one that eventually calls that low-level rule. The places you need to flag references are
Other references are set up automatically by the symbol table. For example, when defining a class that has a superclass, a reference to the superclass is automatically set up. Reference actions are very simple and look like
(which is defined in Resolve ReferencesThe symbol table then needed to check all of its definitions and see which have unresolved references. These references could be a superclass for a class or a variable used in an expression. Each symbol may have nothing to resolve or several things to resolve. It can
be
very difficult things to properly resolving references without entering an
infinite loop,
especially with circular definitions. In some cases we added a variable that
tracks
that if an object (such as a The symbol table classes define methods named An important note: the language specification is extremely important, especially with regard to name lookup. Whenever you are writing a parser, do not assume you know the language you are parsing. Thoroughly read the language specification to ensure you see all the details before starting to write your grammar. Follow this link to The Java Language Specification. Create Report Generation RoutinesWriting a fairly readable report is the final task. We tend to like nicely
indented reports, where indentation shows scope containment of other
definitions. To
facilitate this, we defined a class named We finally defined a The Source CodeAll of the source code for the cross reference tool is provided. To execute the cross reference tool, issue the following commands: java antlr.Tool java.g javac *.java java JavaXRef dir1 dir2 dir3 ... dirn where dir1...dirn are the directories or files to parse. A few simple scripts are available to make building and running the tool simpler. Look in the crossref source directory for these and a README that explains their use. To compile and try this recognizer, you will need to have installed ANTLR 2.0 per its installation instructions. All source code presented in this article is free for your use with no restrictions whatsoever. MageLang Institute and JavaSoft retain no copyright to the source code, nor do they provide any guarantee of the code's suitability for any purpose. As always, use of this code is at your own risk. In addition, Symbol Table Classes describes each of the classes used in the symbol table and has links to each symbol table file independently. ANTLR 2.00, or PCCTS: Part DeuxThe tag generation tool described in the previous article of this parsing series was written using PCCTS (The Purdue Compiler Construction Tool Set), a tool with a long history (going back to the late 1980s). PCCTS could only generate C or C++, but we used it anyway because the new (Java) version was only in alpha release. We are happy to report that the complete rewrite of PCCTS in Java to generate Java, called ANTLR (ANother Tool for Language Recognition), is available and was used to implement the cross-reference tool described above. PCCTS was renamed ANTLR because no one knew how to pronounce PCCTS (although someone suggested that the official pronunciation should be "Fred"). To help you read the grammar provided with this article, we need to describe the primary change when moving from PCCTS to ANTLR: a radical departure from conventional lexical analysis (scanning). ANTLR generalizes the notion of scanning, parsing and tree parsing into the single abstraction of applying grammatical structure to an input stream (be it a stream of characters, tokens or tree nodes). A grammar file in ANTLR now consists of (grammar) class definitions using a single consistent notation rather than different notations for parsing and scanning. Language tools have traditionally used different notations for the seemingly different tasks. For example, programmers used regular expression notation to describe input symbols and BNF (Backus-Naur Form) notation to describe language structure. In other words, programmers had to use the regular expression [a-z]+ (one or more characters from a to z) to describe an identifier and declaration : type IDENTIFIER ASSIGN expression SEMICOLON ; to describe the grammatical structure of a declaration in their language of interest. ANTLR uses the same extended-BNF notation to describe both. So, in a lexical grammar, you can specify:
Along with this wonderful unification ("syntactic yummy"), ANTLR
generates
much stronger scanners than the finite-automata scanners generated by
traditional tools
such as
The increase in scanning power comes at the cost of some inconvenience in scanner specification and indeed requires a shift in your thoughts about lexical analysis. Finite automata do automatic left-factoring of common (possibly infinite) left-prefixes. In a hand-built scanner, you have to find and factor out all common prefixes. For example, consider writing a lexer to match integers and floats. The regular expressions are straightforward: INTEGER:"[0-9]+" ; REAL:"[0-9]+{.[0-9]*}|.[0-9]+" ; Building a scanner for this by hand or using ANTLR's scanner mechanism would
require
factoring out the common For a full description of ANTLR's lexical mechanism, please see the documentation and resources at the ANTLR Home Page. ConclusionWe conclude this series with this cross reference tool. It's a real tool that demonstrates the power of parsing technology. Cross-referencing is a common task and provides a glimpse into several issues related to parser development. A real tool is not just a simple grammar that recognizes input of a given source language; it needs other constructs that work in concert with the parser. We used the "latest and greatest" version of ANTLR to implement the tool, parsing Java source files using Java itself. We demonstrated how a symbol table can be connected with a parser to give the parser some memory and resolve references between definitions in that symbol table. Where to Go for More Information |
![]() |
![]() |
![]() |