Article
Parsers, Part III: A Parser For The Java Language
  
 
 

Articles Index


Java Parser Series Introduction
Part II
Introduction to Part III

In this article, we introduce a parser, or recognizer for the Java language and describe a simple Java source-code browser that uses the parser to tag class, method, and variable declarations. While the code browser is simplistic, it demonstrates how a Java grammar, augmented with a few simple print statements, can provide extremely useful information.

This article includes a Java grammar, the ANTLR parser generator (part of PCCTS, the Purdue Compiler-Construction Tool Set) used to generate a Java parser from the grammar, the source code for the browser, and associated executables for Win32. In the next article, we will dive into the principles behind language recognition and describe the ANTLR parser generator in more detail. For a gentler introduction to parsers and a glossary terms, see Does a Yak Have Antlers? (Parsing for Fun and Profit)

The Java grammar is written for the 1.33 version of the ANTLR parser generator, which generates parsers implemented in C++. To be precise, the parser invoked by the simple browser we have built is compiled with a C++ compiler, and executed via the Java Runtime.exec method call. Implementing the parser in C++ has the advantages that C++ is currently faster than Java, many tool developers' environments are written in C++; and it is available now. The 2.0 version of ANTLR will be available soon. It is being rewritten in Java, and will generate parsers in either Java or C++, and can be extended for other languages as well. Having a Java parser written in Java should increase the number of applications of ANTLR substantially. We look forward to showing you how to integrate the Java version of the parser into various Java applets and applications when the 2.0 version of ANTLR becomes available.

This article proceeds in two sections, first there is a discussion of the grammar file (java.g). This includes a comparison with the Java grammar presented in Chapter 19 of The Java Language Specification, as well as some discussion of the ambiguities that arise when parsing Java and how ANTLR solves those problems. The second section gives an overview of the code browser, and shows how an ANTLR-generated parser can be integrated with a Java application.

Java.g, A Java Grammar for ANTLR

This grammar was originally based on the source code of the Java parser distributed by Sun, and has been updated to more closely coincide with the grammar in the Java Language Specification (JLS). The grammar reflects the 1.0.2 release of the Java language and does not handle JDK 1.1 beta features such as inner classes. The resulting Java parser correctly parses all source code in the JDK 1.0.2 release (and JDK 1.1 minus the inner class declarations).

While the grammar in the JLS is for LALR(1) parser generators, such as yacc, this grammar is an LL(k) grammar that takes advantage of the unique features of ANTLR, an LL(k) parser generator. Although the grammar presented here is very similar to the JLS grammar in overall structure and in many individual rules, there are numerous differences.

When programmers construct parsers or recognizers by hand, they invariably build recursive-descent parsers, which fall into the formal category of LL(k) parsers. For example, the parsers generated by ANTLR are LL(k), which means that the parser parses from left-to-right (as opposed to end-of-file to the beginning), makes decisions on the left-edge of rules, and uses k symbols of lookahead. LR-based parsers such as yacc, which is LALR(1), parse from left-to-right, make decisions from the right edge (after having matched entire alternatives), and use only a single symbol of lookahead. LL(k) parsing requires more lookahead to overcome the reduced context information (seeing only left edge before making a decision).

Recursive-descent LL(1) parsers are simpler, much more flexible, and much easier to understand and debug than LALR(1) parsers. For example, a source-level debugger can be used to step through a human-readable, recursive-descent parser. On the other hand, LALR(1) is often stronger, thus, allowing the programmer to write more natural grammars (rather than having to do lots of grammar left-factoring as with LL(1) parsers). To counter this inadequacy, ANTLR generates predicated-LL(k) parser where k is greater than 1 (k>1). ANTLR-generated parsers use k>1 fixed lookahead, arbitrary lookahead, and semantic predicates to easily outpace LALR(1) parsers in recognition strength. At the same time ANTLR-generated LL(k) parsers retain all the advantages of recursive-descent parsers.

As you will see in the Java grammar, LL(k) parsers have lots of context information (that is, "What have I parsed so far?") whereas LALR(1) parsers do not because they must delay acceptance of a rule until they have seen the entire input matched for that rule. For example, when parsing a method, an action in an LL(k) grammar can access the name of the enclosing class being parsed.

No Semantic Checking or Symbol Table Generation

One important thing to note is that while parsers generated with this grammar should parse all legal Java programs, it will also happily parse many illegal programs that have semantic errors. This is because the grammar file does not contain actions to construct symbolic information about the input Java program. Such information is necessary to catch many errors. For example, the following invalid Java fragment would be accepted by the Java parser:

int i;
i b;  // invalid because i is an int not a typename

The Java parser cannot tell that i is not a type. The construct is, therefore, syntactically acceptable, but semantically invalid. Our intention in this first article is simply to present the grammar along with a practical application of it. Future articles will provide actions to build a symbol table.

Starting from the Top

The rules in this grammar are presented as in the JLS grammar, in a roughly coarsest-to-finest-grain fashion, which is also the way they are executed during a parse, and are grouped by language areas. First there are the top level constructs such as package identification, import statements, and class and interface declarations. Then follow rules for field definitions that appear in the bodies of classes, and finally two large sections that cover statements and expressions.

The top-level constructs in the grammar are relatively straightforward: there is little recursion among rules, and few potential parsing problems. However, when distinguishing between methods, member variables, and constructors, the natural grammar is non-LL(k) for any fixed k:

field
    :   constructorDefinition
    |   methodDefinition
    |   variableDefinitions
    |   "static" compoundStatement
    |   ";"
    ;

Any of the first three alternatives can begin with an arbitrarily long sequence of modifiers such as public static and so on. Rather than do lots of grammar rewriting to factor out the common left-prefixes, we elected, in favor of readability, to use syntactic predicates to obtain a valid parser:

field
  : ( (modifier)* methodHead "\{" )?
    constructorDefinition

  | ( (modifier)* typeSpec methodHead ( "\{" | ";" ) )?
      methodDefinition

  |   variableDefinitions
  |   "static" compoundStatement  
                      // "static { ... }" initializer
  |   ";"

    ;

A syntactic predicate specifies the (possibly infinite) lookahead language that uniquely predicts the associated alternative. For example, if the input is consistent with an optional series of modifiers, a method head, and the start of a code block, the parser will try to match a constructor--the first alternative. If the input is inconsistent with that lookahead, the token stream is rewound and the next alternative is attempted.

Statements

The statement grammar of Java is similar to C and C++. Anyone familiar with parsing Algol-style languages will recognize the structure of this grammar. Basically, a statement can be any one of a set of basic statements, such as a for loop or an if statement, which themselves can contain more statements. So the grammar is highly recursive.

The grammar presented here is noticeably different from that in the JLS. Much of this difference is due to ANTLR's ability to handle various difficult language constructs in a direct fashion or by using syntactic predicates. An example of this is the infamous language ambiguity related to if-then-else statements, wherein the issue is correctly matching an else clause with the correct if. This normally results in a grammar ambiguity message from ANTLR, but the resulting parser behaves naturally as it elects to match the else immediately. We have used the approx pragma to turn off the ambiguity warning from ANTLR. (Please note that the pragma signals ANTLR to avoid trying to compute a full lookahead and instead use a special approximate lookahead algorithm--the squelching of the warning message is a side effect). Handling these cases in an LALR(1) grammar typically requires significant refactoring of related rules, which can be seen in the JLS grammar, in rules such as StatementWithoutTrailingSubstatement.

Expressions

As with statements, the expression grammar of Java is based on C/C++. The structuring of the rules in this section is done so that precedence of operators is correctly observed. Note one parsing difficulty: A cast expression can be confused with a parenthesized expression. This problem is mentioned also in the JLS, and is resolved by using some rule rewriting. The ANTLR grammar uses a syntactic predicate to match a potential type cast. The parser looks past the parenthesized expression to see if an expression follows; if so, the parenthesized expression is a type cast.

An Overview of the Browser Source Code

The Browser class (in Browser.java) is the top-most class for the browser. It implements a main method and handles command line arguments. If a .java file is given as an argument, it runs the parser executable, javap.exe, on the file (using the Runtime.exec method), and builds a TagTree based on the output. It then creates a TagBrowserFrame object, which is the top level window for the code browser. If no file is specified on the command line, it creates a TagBrowserFrame with no default file; however, a default file can be opened using the menu.

The BrowserFrame class is the superclass of TagBrowserFrame and implements several general features of the browser. It creates a menu bar with Open and Quit menu items, implements the event code to capture menu events, and provides two placeholder methods to handle the menu actions. The BrowserFrame class also initializes the layout of the window with a large TextArea for displaying the source code and a List that shows a list of the primary items defined in the .java file, such as classes and methods.

TagBrowserFrame handles many of the details of the browser's behavior. As a subclass of BrowserFrame, it provides implementations for the Open and Quit menu items. The Open item will bring up a FileDialog so the user can choose a new file to browse. The Quit item causes the browser application to exit. The displayFile method reads in the source file and puts it in the TextArea used to display code. When the user clicks on an item in the tag list, the TagBrowserFrame handles the event by scrolling the code TextArea to the line in the source file where the tagged item is defined.

The TagTree class and its subclass, TagTreeRoot, are used to structure the tag information returned by the parser. The parser outputs a line of information for each primary definition in the .java file. This includes class and method definitions, as well as data fields in classes and local variables in methods. The TagTree method readInTags reads in this information and creates a tree node for each line item. There are several typical tree creation methods in TagTree, such as addChild to add a new child to the tree node, and getChildren to get the tree of children. Each TagTree object has an id field that holds the name of the item it refers to in the .java source file, and the line number where that item is defined.

The TagTreeRoot class is a subclass of TagTree that handles the special case of the root node of the tree. The root node has no item to point to in the .java source file, so the id is automatically set to "<root>" in its constructor. It also has a lineCache array used to cache the line numbers of all of its children. This cache is used to rapidly determine which line to scroll the TextArea to when a user double-clicks in the tag list.

Conclusion

Though the browser, as presented, is relatively simplistic, it demonstrates how a C++ parser generated with ANTLR 1.33 can be easily integrated with a Java application. The browser classes can be quickly extended to add more functionality, such as nesting the items in the list box, or capturing parser errors and highlighting the offending code in the TextArea. In a future article we'll describe the browser in more detail and show you how to collect symbol information during a parse, as well as how to build an abstract syntax tree.

Overall, using ANTLR to generate parsers provides several advantages over using LALR(1)-based tools. The most important one is flexibility. The top-down approach of LL(k) that ANTLR uses allows for more communication between rules. Since the parser always "knows" where it is, from the top rule on down, more useful information is available to actions in each rule.

ANTLR's lookahead capabilities (k>1 and syntactic predicates), minimizes the problems traditionally associated with LL parsing, and ultimately gives ANLTR-generated parsers more powerful parsing capabilities than LALR(1) parsers. As shown above with the Java grammar, difficult grammar constructs can be handled easily, resulting in a more readable and maintainable grammar.

Source Code and Executables

The grammar and browser files described in this article are available individually, or zipped up. Please note that we have turned off error messages in the parser (by defining method ANTLRParser::syn() to do nothing) for use with the browser, as syntax errors caused the parser not to return when called from Java.

Win32 users can unzip Browserw32.zip and can start up the browser via java Browser; choose open from the File menu to begin browsing. On a non-Win32 platform, the parser executable will have to be built before being able to open and browse files. First build PCCTS in the Browser/pccts subdirectory by running make from the command line then build the parser from java.g in the Browser/parser subdirectory by typing make again (you may have to set your C++ compiler etc... in the makefiles). See Getting Started With PCCTS for more details.

References

The Java Language Specification, James Gosling, Bill Joy, and Guy Steele. Addison-Wesley, 1996.

Language Translation Using PCCTS & C++, Terence John Parr. Automata Publishing Company, 1996.

Getting Started With PCCTS, Terence Parr, MageLang Institute.


copyright © Sun Microsystems, Inc