O'Reilly Book Excerpts: Learning Java, 2nd Edition
XML Basics for Java Developers, Part 4by Patrick Niemeyer and Jonathan Knudsen
In part four in a series of XML basics for Java developers book excerpts from
Learning Java, 2nd Edition, learn about validating documents.
"Words, words, mere words, no matter from the heart."
William Shakespeare, Troilus and Cressida
In this section, we talk about DTDs and XML Schema, two ways to enforce rules an XML document must follow. A DTD is a grammar for an XML document, defining which tags may appear where and in what order, with what attributes, etc. XML Schema is the next generation of DTD. With XML Schema, you can describe the data content of the document in terms of primitives such as numbers, dates, and simple regular expressions. The word schema means a blueprint or plan for structure, so we'll refer to DTDs and XML Schema collectively as schema where either applies
Now for a reality check. Unfortunately, Java support for XML Schema isn't entirely mature at the time of this writing. XML support in Java 1.4.0 is based on the Apache Project's Crimson parser (which in turn is based on Sun's "Project X" parser). The Crimson engine doesn't support XML Schema. However, a future release of Java will migrate the XML implementation to the Apache Xerces2 engine, and at that time, XML Schema should begin to be supported.
Using Document Validation
XML's validation of documents is a key piece of what makes it useful as a data format. Using a schema is somewhat analogous to the way Java classes enforce type checking in the language. Schema define document types. Documents conforming to a given schema are often referred to as instance documents.
This type safety provides a layer of protection that eliminates having to write complex error-checking code. However, validation may not be necessary in every environment. For example, when the same tool generates XML and reads it back, validation should not be necessary in normal operation. It is invaluable, though, during development. Often, document validation is used during development and turned off in production environments.
The Document Type Definition language is fairly simple. A DTD is primarily a set of special tags that define each element in the document and, for complex types, provide a list of the elements it may contain. The DTD
<!ELEMENT> tag consists of the name of the tag and either a special keyword for the data type or a parenthesized list of elements.
<!ELEMENT Name ( #PCDATA )> <!ELEMENT Document ( Head, Body )>
The special identifier
#PCDATA indicates character data (a string). When a list is provided, the elements are expected to appear in that order. The list may contain sublists, and items may be made optional using a vertical bar (
|) as an
OR operator. Special notation can also be used to indicate how many of each item may appear; a few examples of this notation are shown in Table 23-2.
In This Series
|*||Zero or more occurrences|
|?||Zero or one occurrences|
|+||One or more occurrences|
Attributes of an element are defined with the
<!ATTLIST> tag. This tag enables the DTD to enforce rules about attributes. It accepts a list of identifiers and a default value:
<!ATTLIST Animal class (unknown | mammal | reptile) "unknown">
ATTLIST says that the
Animal element has a
class attribute that can have one of three values:
reptile. The default is
We won't cover everything you can do with DTDs here. But the following example will guarantee zooinventory.xml follows the format we've described. Place the following in a file called zooinventory.dtd (or grab this file from the CD-ROM or web site for the book):
<!ELEMENT Inventory ( Animal* )> <!ELEMENT Animal (Name, Species, Habitat, (Food | FoodRecipe), Temperament)> <!ATTLIST Animal class (unknown | mammal | reptile) "unknown"> <!ELEMENT Name ( #PCDATA )> <!ELEMENT Species ( #PCDATA )> <!ELEMENT Habitat ( #PCDATA )> <!ELEMENT Food ( #PCDATA )> <!ELEMENT FoodRecipe ( Name, Ingredient+ )> <!ELEMENT Ingredient ( #PCDATA )> <!ELEMENT Temperament ( #PCDATA )>
The DTD says that an
Inventory consists of any number of
Animal elements. An
Animal has a
Habitat tag followed by either a
FoodRecipe's structure is further defined later.
To use our DTD, we must associate it with the XML document. We do this by placing a
DOCTYPE declaration in the XML itself. When a validating parser encounters the
DOCTYPE, it attempts to load the DTD and validate the document. There are several forms the
DOCTYPE can have, but the one we'll use is:
<!DOCTYPE Inventory SYSTEM "zooinventory.dtd">
Both SAX and DOM parsers can automatically validate documents that contain a
DOCTYPE declaration. However, you have to explicitly ask the parser factory to provide a parser that is capable of validation. To do this, set the validating property of the parser factory to
true before you ask it for an instance of the parser. For example:
SAXParserFactory factory = SAXParserFactory.newInstance( ); factory.setValidating( true );
Try inserting the
setValidating( ) line in our model builder example at the location indicated above. Now abuse the zooinventory.xml file by adding or removing an element or attribute and see what happens when you run the example.
To really use the validation, we would have to register an
org.xml.sax.ErrorHandler object with the parser, but by default Java installs one that simply prints the errors for us.
Although DTDs can define the basic structure of an XML document, they can't adequately describe data and validate it programmatically. The evolving XML Schema standard is the next logical step and should replace DTDs in the near future. For more information about XML Schema, see http://www.w3.org/XML/Schema. As mentioned earlier, we expect an upcoming Java release to support XML Schema.
JAXB and Code Generation
The ultimate goal of XML will be reached by automated binding of XML to Java classes. There are several tools today that provide this, but they are hampered by the slow adoption of XML Schema.
The standard Java solution is the forthcoming Java XML Binding (JAXB) project. Unfortunately, at the time of this writing, JAXB is not mature. It is difficult to use and doesn't support XML Schema (necessary to fully describe document content). JAXB also requires its own "binding" language to be used, even for simple cases. We hope that the final release of JAXB will provide a good solution for XML binding. You can find information about JAXB at http://java.sun.com/xml/jaxb.
Unlike JAXB, Castor, an open source XML binding framework for Java, works with XML Schema and is relatively easy to use. Unfortunately, at the time of this writing, Castor doesn't support DTDs, and most industry- or task-specific XML standards are still written in terms of DTDs. You can find out more about Castor at http://www.castor.org/.
In the next installment, we conclude this book excerpt series with an introduction to XSL/XSLT and Web services.
Return to ONJava.com.