Dave Cater

Java - XML parsing using SAX

 
Home Page
Career outline
Java
Java references
Java - XML parsing using SAX
Java - XML parsing using DOM
Java - Servlets
Java - SOAP
Linux
Security
Perl
System management
Testing
Musical notes
One of my early Java interests was discovering ways to process XML files.

I soon realised that, as well as the low-level XML APIs with Java bindings (such as DOM and SAX) there were a number of other approaches to processing XML (such as JDOM). In general terms these did more of the work of creating Java classes to represent XML structures.

I decided to concentrate first on the low-level APIs, DOM and SAX. I wanted to see what were the minimum steps needed to set up a working environment to process XML files using each of these methods.

This page describes my work on SAX; there is a separate DOM page.

An article on Mapping XML to Java by Robert Hustead got me started. These are the steps which followed:

  • I made sure PATH is set to include the Java utilities I downloaded in Sun's JDK:
        PATH=/usr/local/jdk1.2.2/bin:$PATH export PATH 
    This overrides old versions of Java utilities supplied with my Linux installation.

  • I followed the links to David Megginson's SAX pages, from where I downloaded SAX 2.0 Java package in a file sax2.zip. This was unpacked using unzip - the file can also be unpacked on Windows using WinZip.

  • As recommended I downloaded the Xerces Java parser from the Apache Software Foundation XML project site. I chose Java version 1 (to be precise 1.3.1) as version 2 seemed still to be in development. The documentation (see below) confirmed it supported the SAX 2.0 API. I then ran into some problems trying to unpack the file downloaded. Although the link implies the file extension is ".tar.gz", the file actually downloaded was Xerces-J-bin_1_3_1_tar.tar. The tar -z flag had to be specified to filter the files through gzip. Using tar any other way (or attempting to uncompress first using gunzip) was of no use. Luckily my version of GNU tar then worked fine:
        tar -xvzf Xerces-J-bin_1_3_1_tar.tar

  • To read the documentation for Xerces 1.3.1 I used the following commands:
        cd xerces-1_3_1/docs/html
    netscape `pwd`/index.html

    To read the documentation for SAX 2.0 I used the following commands:

        cd sax2/docs
    netscape `pwd`/sax2.html

    In both cases a link then takes you straight to the API documentation.

  • I then followed the Quick Start guide in the SAX documentation. In order to compile the sample code, its necessary to set CLASSPATH in the environment to include both the SAX and Xerces Jar files, in my case:
        CLASSPATH=/mnt/DOS_hda1/Linux/sax2/sax2.jar:
    /mnt/DOS_hda1/Linux/xerces1.3.1/xerces-1_3_1/xerces.jar
    export CLASSPATH
    This can also be done by using the -classpath argument to the Java compiler:
        javac -classpath /mnt/DOS_hda1/Linux/sax2/sax2.jar: \
    /mnt/DOS_hda1/Linux/xerces1.3.1/xerces-1_3_1/xerces.jar \
    MySAXApp.java
    This should create the Java class file MySAXApp.class.

  • When running the compiled Java class in the Java interpreter, you will get a Java runtime error message saying
    Exception in thread "main" java.lang.NoClassDefFoundError: MySAXApp
    unless you also add "." to your classpath, or else specify this on the command line. This just makes Java look in the current directory for the class file compiled in the previous step.

  • As mentioned in the Sax documentation, you need to specify a Java property on the command line to identify the name of the SAX2 driver class provided by the XML Apache parser. I found this by hunting through the API documentation:
        java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser \
    MySAXApp

  • Now I had a working XML parser and sample code to read XML files and send to the parser. A few simple tests proved that incorrect XML files yielded sensible error messages:
        java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser \
    MySAXApp file1.xml file2.xml [...]

  • Finally I wanted to put a try block around the setup code, and one around the file reading code, to handle exceptions. I was not entirely happy with declaring that main "throws Exception". Surprise, surprise, Java said I had inadvertently declared a variable in the first try block and attempted to use it in the second block. At last a block structured language that stops you writing spaghetti. I got round this by having a high level block to catch SAX exceptions, and a block around the file reading code just to catch IO and FileNotFound exceptions. So no more spaghetti from me.

    In making this change I discovered that, since FileNotFoundException extends IOException, you have to catch it first or else you have a code block that Java realises is unreachable.