|
![]() |
|
Article from July, 1998. SGML: Changing to Accomodate XMLBy Bob DuCharme Bob DuCharme is the author of " SGMLCD," a tutorial and users guide to free SGML software available from Prentice Hall. He also contributed to the " SGML Buyer's Guide" in the same series and QUE Publishing's "Using SGML." Abstract
" XML is a subset of SGML.well, with a little tweaking of SGML." In this first of a two-part article, regular contributor Bob DuCharme discusses the changes SGML is undergoing to keep pace with the needs of the electronic publishing marketplace, and the requirements for keeping in sync with XML. XML is a subset of SGML.well, with a little tweaking of SGML. The XML specification makes many recommendations for particular ways to code XML documents "for interoperability," a term the XML specification defines in section 1.2 as applying to "a nonbinding recommendation included to increase the chances that XML documents can be processed by the existing installed base of SGML processors that predate the Web SGML Adaptations Annex to ISO 8879." What are these adaptations? Why should they matter to someone creating XML systems? Because XML was not really a proper subset of SGML until SGML got tweaked a little, so the XML spec has interoperability recommendations to highlight possible problems with processing XML documents using SGML software written before the revisions took place. These revisions, known as the "Web SGML Adaptations Annex" to the ISO 8879 SGML standard, consists of two new annexes to the standard. The original version of the standard had nine annexes: A, "Introduction to Generalized Markup," through I, "Nonconforming Variations." The Web SGML adaptations propose new ones K, "Web SGML Adaptations," and L, "Application Requirements for XML." (New Annex J, " TC for "Extended Naming Rules," is a short one that tweaks certain character-set related definitions of ISO 8879 to ease the use of non-Latin alphabets in SGML.) You can see the latest version of the new annexes at http://www.ornl.gov/sgml/wg8/document/1960.htm . The Web SGML annexes define a class of SGML documents that:
This makes it easier to write smaller, faster parsers, because these parsers don't have to account for as many different markup possibilities. Smaller, faster, easier-to-write parsers will make it easier to write browsers, editors, and other applications that use these documents, especially when these documents are stored remotely-for example, on a web server. So, it's called "Web SGML" because it defines a class of documents that will be much easier to use over the web than the full range of possible ISO 8879:1986 SGML documents are. Just as XML has proven useful for far more than Web documents, The Web SGML adaptations are not limited to Web use either-as Annex K tells us, "although motivated by the World Wide Web, applicability of this annex extends to all uses of SGML." Although these goals sound very similar to XML's, Web SGML Annex K barely mentions XML. Annex L's job is to define what it all means in terms of XML. Let's look more closely at Annex K. "Annex K (normative): Web SGML Adaptations"Unlike Annex L, which is "informative," Annex K is "normative": it specifies rules to follow if you want to conform to this spec. Its opening section on definitions introduces several terms that are new to SGML and XML, although most have corollaries in XML: A type-valid SGML document is one with an associated document type declaration and DTD that the document conforms to. XML calls this a "valid" document; before the Web SGML adaptations, all legal SGML documents fell into this category. A fully-declared document instance has all necessary markup declarations explicitly declared. If no declarations at all are explicitly declared, Annex K lets the processor assume an implied prolog of the following line: <!DOCTYPE #IMPLIED SYSTEM []> This means that you can have an SGML document with no prolog, like the following: <thnad>This is a tag-valid WebSGML document<thnad> XML calls this a "well-formed" document. A fully-tagged document instance has a start- and end-tag for every element, and every element has all of its attributes specified with attribute names. Some people have used the term "normalized" to refer to such a document before-for example, James Clark's spam and SGMLnorm programs included with his SP class library, and OmniMark's normal*.xom scripts included with their product. In an integrally-stored document instance, "every element and marked section ends in the entity in which it begins." This is a key aspect of well-formed XML documents and elements; in non-Web SGMLSGML documents you can begin an element or marked section in one entity (for example, one file) and finish it in another. A parser that has finished reading such a file must keep track of incomplete elements and marked sections and look for their proper completion in another entity, so relieving the parser of this responsibility makes its job much easier. A reference-free document only has entity references that refer to predefined data character entities. An external-reference-free document is one with no references to external entities. While XML has no such condition, it does relieve processors of the responsibility of reading external entities when it finds references to them in documents being checked for well-formedness. Annex K specifies many other changes that make Web SGML documents more and more like XML documents:
One key difference between XML and Web SGML is the use of an SGML declaration. All the parameters that an SGML declaration can affect when processing SGML documents are hardcoded in XML, so there's no point in specifying values for these parameters when processing XML documents. The only " XML declaration" is the single processing instruction that identifies the version of XML being used and some optional encoding and external entity dependence information-for example, <?xml version="1.0"?> . (James Clark has made the SGML equivalent of XML's hardcoded SGML declaration available in his "Comparison of SGML and XML" at http://www.w3.org/TR/NOTE-sgml-xml.html .) Web SGML does require an SGML declaration, and one with a new minimum literal: instead of the "ISO 8879:1986" that has always followed the SGML declaration's opening <!SGML , a Web SGMLSGML declaration must have " ISO 8879: 1986 (WWW) " after the opening <!SGML . Web SGML does offer a nice new way to specify an SGML declaration: by using a system or public identifier, just as external entities do. Many SGML users have found it annoying that, when resetting one value in an SGML declaration to be different from the default reference concrete syntax setting (for example, namelen to be greater than 8) they had to include the entire SGML declaration in all the documents that used a particular DTD. Now, you only need a single line like the following at the start of a document to identify an separately stored alternative SGML declaration to use: <!SGML HTML3.2 PUBLIC "+//IDN W3C.ORG// SD HTML Version 3.2//EN"> Next month we'll look at Annex L and at Web SGML's future in the iso standards process and in popular SGML software products. <end/> |


