Article
Java Parser Series Introduction
  
 
 

Articles Index


Java was specifically designed to simplify the complexities of C++ syntax and semantics. This simplification provides benefits for not only the programmers who write programs in Java, but also for the toolsmiths who build Java development tools. C++ is generally considered to be the most difficult language for which to write a parser, also known as a recognizer. Consider that just building a correct symbol table manager for C++ can take a language expert a month of programming effort: whereas for parsing Java a language expert, given the right tools, can build an entire parser in a few days.

While a Java parser could be built by hand, tools called parser generators exist that can write recognizers automatically, given a description of a language's grammatical structure. The ANTLR parser generator is one such tool that has become popular due to its power, simplicity, and flexibility.

This article introduces a four-part series on parsing Java source files as parsing applies to development tool construction, including a discussion of ANTLR and general language recognition principles. Each article in the series will focus on one of the following four subjects:

  • An introduction to the Java parser and its application. A working source code browser application is presented.
  • General principles behind language recognition and translation and how to use the ANTLR parser generator. Source code and binary executables are provided for ANTLR.
  • Symbol table management for Java and how symbolic information can be used to answer questions about Java source code.
  • A discussion of the latest version of ANTLR that generates Java rather than C++. Source and binary executables are provided.

Many programmers dismiss language recognition as purely a compiler writer's problem; however, a number of interesting noncompiler tools can be built using a parser as base. For example, a Java parser lies at the base of the following useful tools:

  • JDK 1.0.2 to 1.1 code migration tool. A translator can be built to detect and change obsolete, discouraged, or renamed method names.
  • Java source code browser. Many companies are building code browsers and debuggers for Java. Being able to examine the source and access symbol table information is crucial.
  • Java source code obfuscator ("munger"). The portability of Java .class files comes at a price at the moment. Byte-code decompilers can reverse the compilation process and obtain essentially the exact Java source code (including variable names) for any compiled Java program. A code obfuscator could simply rename all of a program's classes, variables, and methods to be a1, a2, a3, and so on effectively rendering any decompilation unreadable. A munger can be very useful for protecting your intellectual property by causing meaningless output to be generated when someone tries to decompile your class files with tools like Mocha.
  • In-house Java extensions to aid debugging. A Java translator could be built that accepted debugging extensions (such as "run method x after each access to this object to ensure consistency") or that automatically added extra debugging information to your program.

For more information on the ANTLR parser generator on which this series is based, see the "getting started in ANTLR" page.

Read Part II in this series:


copyright © Sun Microsystems, Inc