oreilly.comSafari Books Online.Conferences.


AddThis Social Bookmark Button
Office 2003 XML

Lightweight XML Editing in Word 2003

by Evan Lenz, coauthor of Office 2003 XML

Did you know that Word documents can be saved in XML format? As of Microsoft Office 2003, the second option in Word's Save As dialog--right under "Word Document (*.doc)"--is "XML Document (*.xml)". This format is Microsoft's own XML vocabulary for Word documents, called WordprocessingML (or sometimes just WordML).

Figure 1

The ability to save Word documents as XML is arguably the most important XML-related feature introduced in Word 2003. But you wouldn't know it from all the hype surrounding Word's new support for customer schemas. When Microsoft announced that Word would let you edit XML documents that conform to your own schema (not just the WordprocessingML one), we were rightly intrigued and even excited. The promise of using the world's most popular word processor to edit, say, Docbook documents was nothing less than astounding, and it caused quite a stir in the XML community.

Hope Deferred

Now that the dust has settled and Office 2003 has been available for almost a year, we've got a clearer picture of reality. While the XML features in Word, Excel, Access, and the new InfoPath application are truly impressive and useful, it's clear that Word 2003 doesn't support arbitrary XML editing. At least it doesn't line up with the picture Microsoft painted originally. For one, the custom schema functionality is available only with Office Professional or the stand-alone Word 2003. More importantly, the features don't live up to the hype. While, strictly speaking, you can edit custom XML in Word, you're limited to using schemas that have a very static, fill-in-the-blanks structure. That means no optional or repeating elements and certainly no mixed content--that is, if you want a minimally user-friendly experience.

Or you could force your users to apply XML elements to portions of their document manually, using the new XML Structure task pane with Show XML Tags turned on. In that case, yes, they could edit arbitrary XML documents, even those with mixed content. And yes, Word will let them know if they've done something invalid (though it won't stop them from doing it). But since the user has to do all the work, and since XML elements cannot be associated with style information, the experience is not close to being user friendly (let alone WYSIWYG).

Or you could try to script in all the user friendliness by hand through the new Document Actions task pane. Of course, you should plan on joining a monastery to learn Smart Document programming and the attendant asceticism you'll need in order to appreciate the usability (or lack thereof) of your efforts' final results. (Tell me again, why are we using Word?)

Or (finally) you could come to terms with the fact that the most important (and robust) XML feature that was introduced in Word 2003 is its capability to save documents in a lossless, well-formed, open XML format called WordprocessingML. Ways to use it for generating, transforming, converting, querying, and otherwise processing Word documents are only starting to be realized. Editing custom XML may not be WordprocessingML's killer app, but it does raise some interesting possibilities that we'll explore here.

A Lightweight XSLT-Based Approach

This article presents a lightweight approach to XML editing in Word. It's "lightweight" in that it ignores all of Word's built-in custom schema functionality. A nice side effect of this approach is that it works in all editions of Word 2003. All you need outside of Word is an XSLT processor. (If you do happen to have the advanced XML functionality, you can make use of Word's bundled XSLT processor, but that's not required.)

This approach to editing will work only when your XML format is isomorphic to the structure and styles of your Word documents. The document's markup will only be as rich as the styles that are applied to it, so this rules out full-on Docbook editing. Word doesn't work well for editing recursive markup structures in general, because it doesn't support recursive styles. Each paragraph has exactly one paragraph style, and each character is associated with exactly one character style. (Word does, however, provide a convenient representation of heading levels as hierarchical subsections, using the <wx:sub-section> element, which we'll see referenced in our example below.)

You can make a complete XML editing solution for Word by writing two XSLT style sheets:

  1. A style sheet to transform from your custom XML to WordprocessingML, and
  2. A style sheet to transform from WordprocessingML back to your custom XML.

The basic scenario goes like this: to edit a custom XML document, it must get transformed by XSLT (No. 1) into WordprocessingML so that a user can edit it in Word. After the user is finished editing the document, the resulting WordprocessingML must be transformed again (No. 2), back to the custom XML format.

Note: This article does not introduce WordprocessingML except by example. For more thorough coverage, refer to the Office 2003 XML sample chapter available online, called "The WordprocessingML Vocabulary" (PDF).

An Example

Before we look at the XSLT, here's a document that conforms to a dead-simple, Docbook-esque format that we'll be editing:

<?xml version="1.0"?>
<?mso-application progid="Word.Document"?>
<?xml-stylesheet type="text/xsl" href="article2wordml.xsl"?>
  <title>This is the article title</title>
    <title>First section</title>
    <para>This is the <emphasis>first</emphasis> paragraph.</para>
    <para>This is the <strong>second</strong> paragraph.</para>
    <title>Second section</title>
    <para>This section will have some sub-sections.</para>
      <title>First sub-section</title>
      <para>This is the paragraph text of the first sub-section.</para>
      <title>Second sub-section</title>
      <para>This is the paragraph text of the second sub-section.</para>
      <para>And here is another paragraph, just for the fun of it--with a
<a href="">hyperlink</a> to boot!</para>

Here is what we want this document to look like while it's being edited in Word:

Figure 2

As you can see, the XML has a few examples of mixed content, which are rendered in Word using character styles (italic, bold, and blue/underlined). The hierarchical sections of the XML document are rendered using a heading for each title (Heading 1 for the article title, and Heading 2, Heading 3, and so on for successively deep section titles).

The Code

Assuming we have this XML document lying around already and we want to let people edit it, we'll need an XSLT style sheet to transform it to WordprocessingML (style sheet No. 1 in the list above). This file is called article2wordml.xsl. It contains various template rules that map elements from the custom XML to elements and styles defined in WordprocessingML. For example, to turn <emphasis> elements into character runs with the Emphasis character style, we use the following template rule:

<!-- For text in <emphasis>, apply the "Emphasis" character style -->
  <xsl:template match="emphasis/text()">
        <w:rStyle w:val="Emphasis"/>
        <xsl:value-of select="."/>

To turn section titles into hierarchical headings, we use this template rule:

<!-- Convert section titles to "Heading X" paragraphs -->
  <xsl:template match="section/title">
        <w:pStyle w:val="Heading{count(ancestor::section)+1}"/>

Once a user has made changes to the document from within Word, a new WordprocessingML document is saved and must be translated back to the custom XML format using style sheet No. 2 mentioned above. This style sheet, called wordml2article.xsl, has similar rules, except that they reflect the reverse mapping--from WordprocessingML to our custom XML format. For example, here's the rule that turns text in the Emphasis style into an <emphasis> element:

<!-- turn a run with the "Emphasis" character style into <emphasis> -->
  <xsl:template match="w:r[w:rPr/w:rStyle/@w:val='Emphasis']"
      <xsl:copy-of select="w:t/text()"/>

Here are the rules that convert the Heading paragraphs back to sections with titles:

<!-- Convert <wx:sub-section> elements to <section> elements -->
  <xsl:template match="wx:sub-section">

<!-- Convert <w:p> paragraphs to <para> paragraphs -->
  <xsl:template match="w:p">
      <xsl:apply-templates mode="para-content"/>

<!-- ...except for the first paragraph in a sub-section (Heading 1,2,3,...);
       the heading will be the <title> of the section -->
  <xsl:template match="wx:sub-section/w:p[1]">
      <xsl:apply-templates mode="para-content"/>

For a complete investigation of the style sheets (including descriptive comments), see the full text of these files:

Pages: 1, 2

Next Pagearrow