Friday 3-Sep-2010.
New Book
XRay XML Editor
Company
University
Solutions
<TAG>
Xmlu.com
Current Weather
Ski Conditions

Article from November, 1998.


Automating SGML-to- XMLDTD Conversion (Part 2)

By Bob DuCharme

Bob DuCharme, a software engineer at Moody's Investors Service, is also the author of The Operating Systems Handbook , a crash course in mini and mainframe operating systems, publish ed by McGraw-Hill Professional Books. © 1998 Bob DuCharme.


Abstract

When migrating from SGML to XML, you need to know some syntactical and philisophical differences between the two standards. Bob presents a Perl program that can convert an SGML DTD most of the way to an XML DTD.


Why Automate?

Last month we saw that if your system takes advantage of validating XML applications without putting SGML behind it (for example, if XML is one of several outputs from your SGML system), you have to maintain parallel sets of DTDs or develop a way to automatically convert an SGMLDTD to an XML one each time the SGML version is edited. Automatic conversion makes more sense for the same reason that converting documents from a central set or generating database views from a central database makes more sense than trying to maintain parallel sets of similar information: less data redundancy means more efficiency and less error.

DTD conversion issues fall into three categories: changes that you can make to the original SGMLDTD without losing information (for example, ensuring case consistency for element type and attribute names), changes that can be easily automated, and the tougher issues that each require a judgment call. In part one of this article, after discussion various examples of the first category of conversion issues, we left off with a quick look at a Perl script ( ) that can handle the second category; in this concluding part, we'll look more closely at the Perl script and discuss strategies for dealing with the third, more difficult category of conversion issues.

#! /usr/local/bin/perl

$/ = "<";   # New input line delimiter in case 
input
                  # declarations have carriage returns in them.
while (<>) {

# Lose minimization parameters.
  s/^\!ELEMENT\s+(\S+)\s+[\-o]\s+[\-o]\s+/!ELEMENT $1 /i;

# Add "?" before PI's closing ">".
  s/\?(.*?)>/\?$1\?>/g;

# Convert attribute types to NMTOKEN.
  s/NUTOKEN/NMTOKEN/ig;
  s/NUMBER/NMTOKEN/ig;
  s/NAME/NMTOKEN/ig;

  # Lose exclusion specifications.
  s/\s*\-\([^\)]*\)\s*//;

  # Point to different directories for ISO entities.
    s/c:\\dev\\sgml\\dtds\\/
    c:\\dev\\sgml\\dtds\\xml\\/;

  print;
}

(Keep in mind that the following would be easy enough to implement in awk, OmniMark, or any text processing languages with pattern-matching capabilities.)

Deletion of minimization parameters and conversion of processing instructions are simple, although a minority of SGML processing instructions may not conform to XML standards by merely having a question mark added before their closing ">" character. James Clark's comparison of SGML and XML describes at http://www.w3.org/TR/NOTE-sgml-xml.html further changes that may be necessary to your processing instructions; consult his paper if an XML processor doesn't like your converted processing instructions.

The Perl script doesn't have to explicitly provide for the conversion of the plural form attribute types NUTOKENS , NUMBERS , or NAMES , because the lines that convert the singular versions will change the appropriate characters of the plural versions as well. If any of your element or entity names have the string " NUTOKEN," " NUMBER," or " NAME" (the latter two being far more likely) then those lines of the Perl script, as written, will alter your element or entity names, and will need extra code to ensure that they only affect the attribute type parameters of attribute list declarations.

A simple script like this could never automate the conversion of an element declaration's inclusion specification into its content model. This issue is explored further below; luckily, the simple deletion of an exclusion specification from your XMLDTD is a perfectly backward-compatible change to make and won't hurt your publishing system. If your SGMLDTD has the following declarations,

				<!ELEMENT para - - (#PCDATA) +(emph)>
				<!ELEMENT emph - - (#PCDATA) -(emph)>
			
authors using SGML editing software won't be able to put emph elements inside of their emph elements, and the exclusion specification will have served its purpose. Once the Perl script converts the second declaration into the following,
				<!ELEMENT emph (#PCDATA)>
			
it will be completely backward-compatible with the data. The XML version doesn't need the exclusion specification anyway--it's in the SGML declaration to constrain authors, and the XMLDTD isn't for your authors, but for one of your output processors.

XML doesn't support SDATA entities, and instead of dealing with the general case of these, this Perl script deals with the specific case that accounts for their most common use: declaring special characters. The script merely changes the path names so that DTD system identifiers will look in a new system directory for the entity sets; you would keep XML-compliant special character declarations (for example, Rick Jelliffe's set at http://www.sil.org/sgml/xml-ISOents.txt ) in this alternative directory. The Perl script would change the following SGML declaration (note that the system identifier has been added to the SGMLDTD as a category 1 conversion edit)

				<!ENTITY % ISOlat1 PUBLIC
    "ISO 8879-1986//ENTITIES Added
    Latin 1//EN"
    "c:\dev\sgml\dtds\isonum.ent">
			
to this, revising the path name:
				<!ENTITY % ISOlat1 PUBLIC
    "ISO 8879-1986//ENTITIES Added
    Latin 1//EN"
"c:\dev\sgml\dtds\xml\isonum.ent">
			

Changes for the Worse

Now, the hard part. What about mixed content models? What about inclusions, RCDATA , CONREF , and SUBDOC ? The answer is not a pleasant one. Like the category 1 changes, these require manual changes to the DTD, but with a cost: some of the expressive power that has people still using SGML instead of XML.

If this means changing the following content model, which (even after conversion by the Perl script) is unacceptable to an XML processor,

				<!ELEMENT chapter - - (para|#PCDATA)+ +(illus)>
			
to this,
				<!ELEMENT chapter - - (#PCDATA|para|illus)*>
			
it's no great loss; the original didn't have much structure anyway. On the other hand, an element declaration of
				<!ELEMENT chapter - - (title,(para|#PCDATA)+) +(illus)>
			
has more structure to it, and converting it to
				<!ELEMENT chapter - - (#PCDATA|title|para|illus)*>
			
gives up the original's constraint that the title element must come first in a chapter .

Perhaps your DTD would benefit from an even more rigorous structure. You could declare new elements to act as containers for the PCDATA in your mixed content models so that they become element content, but this can have an even greater cost: it's not backward-compatible with your data, so you'll have to change your documents and all the other processes that manipulate them to account for this change.

So, some changes have costs, and you'll have to compare these costs with the benefits of making these changes to see if they're worth it. It helps to remember that you can customize the script for your particular system--for example, when I described the technique for avoiding SDATA entity declarations by changing system path names to point at new sets of ISO entity sets, I didn't point out that you won't use that script line verbatim, but will instead substitute your own path names. This seemed obvious, but it's only one example of this kind of customization: the script is just a starting point, and customizing it to work for your system should be a much easier task than revising it to be all things to all DTDs. My Perl script doesn't change all occurrences of RCDATA to PCDATA or to CDATA because either change would have made too vast a generalization about everyone's data. You, however, are in a position to make generalizations about your own data, and you can use these to your advantage when developing a DTD conversion script. If you don't use CONREF , don't worry about CONREF . If you do use it, see if your DTDs use it in a consistent enough manner that a line or two of a script will change it to something acceptable to an XML parser.

In my first draft of this two-part article, the script and the list of category 2 changes had only the first three issues. As I thought about the tougher, category 3 issues, I eventually realized that content model exclusions and SDATA entities weren't such intractable problems after all, and added the lines we saw above to the script. Moral: don't get too discouraged if your DTDs at first seem to have features that can't be converted by an automated process. Your first glance through them won't reveal all the repetitive patterns that a script can use as a handle to turn your declarations into something that an XML processor will happily accept. <end/>

Format for Printing



HomeContactusCopyright
All original material on this site is copyright © 1994-2010 by Architag International Corporation, All rights reserved. No part of this information may be reproduced in any form without express permission from
Architag International Corporation.