Friday 3-Sep-2010.
New Book
XRay XML Editor
Company
University
Solutions
<TAG>
Xmlu.com
Current Weather
Ski Conditions

Article from October, 1998.


Automating SGML-to- XMLDTD Converstion (Part 1)

By Bob DuCharme, XML Correspondent

Bob DuCharme, a software engineer at Moody's Investors Service, is also the author of The Operating Systems Handbook , a crash course in mini and mainframe operating systems, published by McGraw-Hill Professional Books.


Abstract

SGML DTDs can be converted to XML DTDs by following just a couple of rules. The differences between SGML DTDs and XML DTDs fall into two categories: those that can be changed automatically (by computer program), and those that can't. In this first of a two-part article, the author shows a perl program that can handle the changes that can be made by computer.


All discussions of SGML-to- XMLDTD conversion (for example, Norm Walsh's excellent Converting an SGMLDTD to XML ( http://www.xml.com/xml/pub/98/07/dtd/index.html ) agree that while much of the process might be automated, for most DTDs, some judgment calls are necessary. Mixed content models, public identifiers with no accompanying system identifiers, inclusions, RCDATA, and other issues require a knowledgable human to figure out alternatives and select one.

Maybe It's Not a One-Time Conversion

The manual nature of this process can be a problem if the conversion must be done repeatedly. Why would you need to translate the same DTD more than once? Look at it this way: why translate the same document more than once? Because after you make some edits, you need both versions to reflect the edits.

Maybe, when you convert your SGML to XML, you're not putting SGML behind you. A complex SGML publishing system may be responsible for output to print, CD-ROM, XML, HTML, Palm Pilot DOC files, and RTF. XML is only one item in a list here, but if the XML application in question uses a validating XML processor, it needs a DTD with its documents.

Imagine that after doing the analysis and manual edits necessary to create an XMLDTD from the SGMLDTD, you're publishing away for three months, and then you make a change to your SGMLDTD. What about the XMLDTD? You certainly don't want to do all the analysis and manual editing necessary to redo the conversion, so perhaps you just make the corresponding edit to your XMLDTD by hand.

If you're serious enough about SGML that you're maintaining DTDs, or even if you're a relational database person who knows why we normalize database table schema before implementing them, a little voice in your head should say, "But that would be wrong!" Maintaining parallel, similar sets of information by making parallel manual (or even automatic) edits to them is asking for trouble. Whether you automatically generate your mailing list from your relational database customer list, your RTF files from your SGML files, or your XMLDTD from your SGMLDTD, automatic generation of alternative versions of your information from a single, centrally maintained source means greater efficiency and a lower probability for trouble. This greater efficiency and lower trouble scale geometrically as the amount of necessary editing increases over time.

Three Categories of Conversion Issues

So how do we approach the tougher problems of DTD conversion? First, let's divide all conversion issues into three ranked categories:

  • Changes to the original SGMLDTD that don't deprive it of any information.

  • Changes that can be made using an automated utility.

  • Changes that require a judgment call.

I won't try to assign every single conversion issue to one of these categories; instead, I'll list some representative ones for each and explain strategies for dealing with category 3. James Clark's comparison of SGML and XML at http://www.w3.org/TR/NOTE-sgml-xml.html gives the most complete listing of issues, and between Norm's article and the approaches shown here, you'll be able to determine the best course of action to take with your own DTDs.

Few issues fall squarely into category 1 without some overlap into category 2. Enough facility with a scripting language would let you automate most of them. However, their qualification for category 1 means you may as well go ahead and make the changes to your SGMLDTD--it won't lose any power in its role as an SGMLDTD and getting these changes out of the way makes your category 2 conversion script simpler and more efficient by minimizing its responsibilities.

Category 1 edits to make to your SGMLDTD include the following:

  • Conversion of keywords like "system" and "DocType" to all upper-case.

  • Quoting of default attribute values.

  • Separation of enumerated attribute choices by the pipe symbol (|) instead of a comma.

  • Separating elements declared as part of a name group into their own individual element and attribute list declarations.

  • Adding system identifiers after each public identifier. You might add code to a conversion script that looks up each public identifier's corresponding system identifier in an entity catalog, but unless your catalog's mappings change often, this would be easier to do by hand permanently in your SGMLDTD. (The use of parameter entities in your system identifiers lets you retain much of the flexibility that led you to use parameter entities in the first place.)

  • Conversion of element and entity names to a consistent case.

  • Moving comments from element type and attribute list declarations to their own comment declarations.

The last two jobs in particular would benefit from being done manually. The only way to automate consistent case usage among element and entity names would be to convert them all to upper-case or all to lower-case. Conversion by hand would result in more readable DTDs, because a name like CustLastName is easier to read than CUSTLASTNAME. Comments moved out of element type and attribute list declarations to their own comment declarations by an automated procedure won't be as human-readable as comments moved by a human will, so they won't be as effective.

Changes That Can Be Automated

This leaves five undisputed category 2 changes to make to your DTD:

  • Removing minimization parameters.

  • Adding a "?" before each processing instruction's closing ">" character.

  • Converting unsupported attribute types to the NMTOKEN (or, for plural cases, NMTOKENS) type.

  • Deleting exclusion specifications.

  • Pointing to new sets of ISO entities to avoid SDATA declarations.

The Perl script in performs these five conversions. Because we moved all borderline category 1/category 2 cases to category 1, the script is short and simple and would be easy enough to implement in awk, Omnimark, or any text processing languages with pattern-matching capabilities.

#! /usr/local/bin/perl
$/ = "<";   # New input line delimiter in case input 
# declarations have carriage returns in them.
while (<>) {
   # Lose minimization parameters.
   s/^\!ELEMENT\s+(\S+)\s+[\-o]\s+[\-o]\s+/!ELEMENT $1 /i;
   
   # Add "?" before PI's closing ">".
   s/\?(.*?)>/\?$1\?>/g;

   # Convert attribute types to NMTOKEN.
   s/NUTOKEN/NMTOKEN/ig;
   s/NUMBER/NMTOKEN/ig;
   s/NAME/NMTOKEN/ig;

   # Lose exclusion specifications.
   s/\s*\-\([^\)]*\)\s*//;

   # Point to different directories for ISO entities.
    s/c:\\dev\\sgml\\dtds\\/c:\\dev\\sgml\\dtds\\xml\\/;
   
   print;
}

In the next issue of <TAG> , we'll step through this script and discuss strategies for dealing with the judgment call conversion issues that can't be fully automated. <end/>

© 1998 Bob DuCharme.

Format for Printing



HomeContactusCopyright
All original material on this site is copyright © 1994-2010 by Architag International Corporation, All rights reserved. No part of this information may be reproduced in any form without express permission from
Architag International Corporation.