![]() |
![]() |
![]() ![]() |
![]() |
![]() |
In this article, we introduce a parser, or recognizer for the Java language and describe a simple Java source-code browser that uses the parser to tag class, method, and variable declarations. While the code browser is simplistic, it demonstrates how a Java grammar, augmented with a few simple print statements, can provide extremely useful information. This article includes a Java grammar, the ANTLR parser generator (part of PCCTS, the Purdue Compiler-Construction Tool Set) used to generate a Java parser from the grammar, the source code for the browser, and associated executables for Win32. In the next article, we will dive into the principles behind language recognition and describe the ANTLR parser generator in more detail. For a gentler introduction to parsers and a glossary terms, see Does a Yak Have Antlers? (Parsing for Fun and Profit)
The Java grammar is written for the 1.33 version of the ANTLR parser
generator, which generates parsers implemented in C++. To be precise,
the parser invoked by the simple browser we have built is compiled
with a C++ compiler, and executed via the Java
This article proceeds in two sections, first there is a discussion of
the grammar file ( Java.g, A Java Grammar for ANTLR This grammar was originally based on the source code of the Java parser distributed by Sun, and has been updated to more closely coincide with the grammar in the Java Language Specification (JLS). The grammar reflects the 1.0.2 release of the Java language and does not handle JDK 1.1 beta features such as inner classes. The resulting Java parser correctly parses all source code in the JDK 1.0.2 release (and JDK 1.1 minus the inner class declarations). While the grammar in the JLS is for LALR(1) parser generators, such as
When programmers construct parsers or recognizers by hand, they invariably build recursive-descent parsers, which fall into the formal category of LL(k) parsers. For example, the parsers generated by ANTLR are LL(k), which means that the parser parses from left-to-right (as opposed to end-of-file to the beginning), makes decisions on the left-edge of rules, and uses k symbols of lookahead. LR-based parsers such as yacc, which is LALR(1), parse from left-to-right, make decisions from the right edge (after having matched entire alternatives), and use only a single symbol of lookahead. LL(k) parsing requires more lookahead to overcome the reduced context information (seeing only left edge before making a decision). Recursive-descent LL(1) parsers are simpler, much more flexible, and much easier to understand and debug than LALR(1) parsers. For example, a source-level debugger can be used to step through a human-readable, recursive-descent parser. On the other hand, LALR(1) is often stronger, thus, allowing the programmer to write more natural grammars (rather than having to do lots of grammar left-factoring as with LL(1) parsers). To counter this inadequacy, ANTLR generates predicated-LL(k) parser where k is greater than 1 (k>1). ANTLR-generated parsers use k>1 fixed lookahead, arbitrary lookahead, and semantic predicates to easily outpace LALR(1) parsers in recognition strength. At the same time ANTLR-generated LL(k) parsers retain all the advantages of recursive-descent parsers. As you will see in the Java grammar, LL(k) parsers have lots of context information (that is, "What have I parsed so far?") whereas LALR(1) parsers do not because they must delay acceptance of a rule until they have seen the entire input matched for that rule. For example, when parsing a method, an action in an LL(k) grammar can access the name of the enclosing class being parsed. No Semantic Checking or Symbol Table Generation One important thing to note is that while parsers generated with this grammar should parse all legal Java programs, it will also happily parse many illegal programs that have semantic errors. This is because the grammar file does not contain actions to construct symbolic information about the input Java program. Such information is necessary to catch many errors. For example, the following invalid Java fragment would be accepted by the Java parser: int i; i b; // invalid because i is an int not a typename
The Java parser cannot tell that Starting from the Top The rules in this grammar are presented as in the JLS grammar, in a roughly coarsest-to-finest-grain fashion, which is also the way they are executed during a parse, and are grouped by language areas. First there are the top level constructs such as package identification, import statements, and class and interface declarations. Then follow rules for field definitions that appear in the bodies of classes, and finally two large sections that cover statements and expressions. The top-level constructs in the grammar are relatively straightforward: there is little recursion among rules, and few potential parsing problems. However, when distinguishing between methods, member variables, and constructors, the natural grammar is non-LL(k) for any fixed k:
Any of the first three alternatives can begin with an arbitrarily long
sequence of modifiers such as
A syntactic predicate specifies the (possibly infinite) lookahead language that uniquely predicts the associated alternative. For example, if the input is consistent with an optional series of modifiers, a method head, and the start of a code block, the parser will try to match a constructor--the first alternative. If the input is inconsistent with that lookahead, the token stream is rewound and the next alternative is attempted. Statements
The statement grammar of Java is similar to C and C++. Anyone familiar with
parsing Algol-style languages will recognize the structure of this grammar.
Basically, a statement can be any one of a set of basic statements, such as a
The grammar presented here is noticeably different from that in
the JLS. Much of this difference is due to ANTLR's ability to handle
various difficult language constructs in a direct fashion or by using
syntactic predicates. An example of this is the infamous language
ambiguity related to Expressions As with statements, the expression grammar of Java is based on C/C++. The structuring of the rules in this section is done so that precedence of operators is correctly observed. Note one parsing difficulty: A cast expression can be confused with a parenthesized expression. This problem is mentioned also in the JLS, and is resolved by using some rule rewriting. The ANTLR grammar uses a syntactic predicate to match a potential type cast. The parser looks past the parenthesized expression to see if an expression follows; if so, the parenthesized expression is a type cast. An Overview of the Browser Source Code
The
The
The
The Conclusion Though the browser, as presented, is relatively simplistic, it demonstrates how a C++ parser generated with ANTLR 1.33 can be easily integrated with a Java application. The browser classes can be quickly extended to add more functionality, such as nesting the items in the list box, or capturing parser errors and highlighting the offending code in the TextArea. In a future article we'll describe the browser in more detail and show you how to collect symbol information during a parse, as well as how to build an abstract syntax tree. Overall, using ANTLR to generate parsers provides several advantages over using LALR(1)-based tools. The most important one is flexibility. The top-down approach of LL(k) that ANTLR uses allows for more communication between rules. Since the parser always "knows" where it is, from the top rule on down, more useful information is available to actions in each rule. ANTLR's lookahead capabilities (k>1 and syntactic predicates), minimizes the problems traditionally associated with LL parsing, and ultimately gives ANLTR-generated parsers more powerful parsing capabilities than LALR(1) parsers. As shown above with the Java grammar, difficult grammar constructs can be handled easily, resulting in a more readable and maintainable grammar. Source Code and Executables
The grammar and browser files described in this article are available
individually, or zipped up. Please note that we have turned off error messages
in the parser (by defining method Win32 users can unzip
References The Java Language Specification, James Gosling, Bill Joy, and Guy Steele. Addison-Wesley, 1996. Language Translation Using PCCTS & C++, Terence John Parr. Automata Publishing Company, 1996. Getting Started With PCCTS, Terence Parr, MageLang Institute. |
![]() |
![]() |
![]() |