Lightweight XML Editing in Word 2003
Pages: 1, 2
Word 2003 also quietly introduces a new feature called formatting restrictions. When you have formatting restrictions enabled, users are restricted to using the set of styles that you specify. They can't modify the styles, nor can they apply direct formatting (such as bold or italic) to their document. While not specifically an XML feature, this enables a sort of document validation that makes particular sense when you are using the lightweight XML editing approach described above. It lets you restrict the range of formatting constructs that your conversion XSLT will have to handle. Rather than writing a generic WordprocessingML transformation, your style sheet will have to handle only those Word documents that are restricted to a particular Word template and its styles. This is a global restriction--a set of allowed styles, as opposed to a content model schema. You can't, for example, enforce that the Emphasis character style be used only in Normal paragraphs. Nevertheless, it is a profoundly useful feature for XML editing applications in Word.
If you look back in article2wordml.xsl, you'll see that formatting restrictions are enabled as a document setting:
<w:docPr> ... <w:documentProtection w:formatting="on" w:enforcement="on"/> </w:docPr>
The particular styles that are locked or unlocked are indicated as such in
the WordprocessingML's global
<w:styles> element. In this case, we
restrict the users to only the styles they see in the "Styles and Formatting"
The other built-in styles normally available to Word users appear as if they don't even exist anymore.
Using Word's XSLT Processor
This editing "solution" will work regardless of the edition of Word 2003 you have, provided that you have an external XSLT processor to do the transformations between edits. But if you have Office Professional or the stand-alone Word 2003, then you don't need another XSLT processor; you can use the bundled XSLT processor that comes with those editions of Word. Looking back at our article XML example, we see two processing instructions:
<?mso-application progid="Word.Document"?> <?xml-stylesheet type="text/xsl" href="article2wordml.xsl"?>
The mso-application processing instruction (PI) associates the XML file with the Word application, so that when a user double-clicks the file, Word opens the XML file, overriding whatever the default XML viewer is on their system. The second PI is useful only if you've got the advanced XML features. Upon opening the file, the user is presented with an option to apply article2wordml.xsl to the document, yielding the editing view we saw above. This is called an onload transformation.
Our other style sheet, wordml2article.xsl, is called an onsave style sheet, as it is applied to the WordprocessingML representation of the edited Word document when the user saves the document after making changes. How does Word know to use this style sheet, you ask? It is referenced inside the WordprocessingML result of the onload transformation. If you look inside article2wordml.xsl, you'll see the relevant document properties being set like so:
<w:docPr> <!-- This only works if you're using Word 2003 standalone or Office 2003 Professional --> <w:useXSLTWhenSaving/> <w:saveThroughXSLT w:xslt="wordml2article.xsl"/> ... </w:docPr>
The end result is that end users can open, edit, and save the custom XML file without having to invoke any external IT processes. Word handles both XSLT transformations to and from WordprocessingML.
This approach treats XML editing as essentially a conversion problem. While the activity of conversion isn't the same as that of editing, they're related. If you can create a reasonably reliable transformation from a legacy document format to a desired XML format, then it stands to reason that you could use the same transformation for new Word documents that users create.
A few things can make this easier for the scenario in which authors are creating new documents, as opposed to you converting legacy documents. Before users start authoring documents, you have the freedom to decide what Word template to use, along with the appropriate styles--whereas you don't have that option when converting legacy documents that already exist.
Another advantage of this approach is that it doesn't force the Word user to adopt a new model or way of thinking or editing (which is decidedly not the case if you make them use Word's built-in custom XML features). The savvy Word author doesn't have to know that the document will be converted to XML later on. They just know that using styles is good practice. But even if they don't know that, we can force them (through formatting restrictions) to use the correct styles to get the formatting they want.
One of the things I like about this "lightweight" approach is that, beyond creating a Word template, the only code you have to write is two XSLT style sheets. It sounds deceptively simple. The problem is that the more complicated your XML formats become, the more difficult it will be to define round-trip mappings between them and WordprocessingML. In the real world, we usually want to support at least some forms of recursive markup. For example, we should be able to specify that some text is "strong" and "emphasized" by using markup like this:
<strong>This is bold <emphasis>and italic</emphasis>.</strong>
But since Word doesn't support such combinations, you have to merge these
into a single style definition, called something like StrongAndEmphasis. And
you'll want to also account for the scenario in which a
element appears inside an
<emphasis> element, not just the other
way around. So we would need to add a rule to our onload style sheet
that looks something like this:
<xsl:template match="strong/emphasis/text() | emphasis/strong/text()" priority="1"> <w:r> <w:rPr> <w:rStyle w:val="StrongAndEmphasis"/> </w:rPr> <w:t> <xsl:value-of select="."/> </w:t> </w:r> </xsl:template>
The transformation back to the custom XML format is even trickier if we want to avoid flattened markup that looks like this in the result:
<strong>This is bold </strong> <strong><emphasis>and italic</emphasis></strong> <strong>.</strong>
That's not to say that your average XSLT wizard won't be able to figure out a solution--maybe even a generic solution. (I can imagine using a two-stage transformation that would allow you to reintroduce a normalized hierarchy into the markup, but that's getting out of scope here.) It's just that it won't be terribly straightforward. Even so, I like the challenge.
The takeaway from this article should not be that Word's custom XML schema features are completely useless. No, they have their uses, particularly if you've got more data-oriented, business-template document formats. The thing to keep in mind is that this is essentially version 1.0 technology. It is exciting, even if it's not ready for prime time in terms of general XML editing. It will definitely be interesting to see what the next version of Word will add in terms of XML support. Until then, you might still be able to employ Word in a robust and usable way for your document-oriented XML applications with a little bit of creativity and XSLT trickery.
Evan Lenz is an XML developer specializing in XSLT.
O'Reilly Media, Inc., recently released (June 2004) http://www.oreilly.com/catalog/officexml.
Chapter 2, The WordprocessingML Vocabulary, is available free online.
For more information, or to order the book, click here.
Return to WindowsDevCenter.com.