|
![]() |
|
Article from April, 1998. Standalone DocumentsBy Bob DuCharme, XML Correspondent Bob DuCharme is the author of " SGMLCD," a tutorial and users guide to free SGML software available from Prentice Hall as part of Charles Goldfarb's series on Open Information Management. He also contributed to the " SGML Buyer's Guide" in the same series and QUE Publishing's "Using SGML." Abstract
In this technical article, the author explores the use of the "standalone" attribute on the XML declaration. Because XML is a simplified version of SGML, most of its markup is familiar to SGML people. Of the new markup designed specifically for XML, the best place to start is the XML declaration, a processing instruction that identifies the document as an XML document and indicates which version of the XML specification it conforms to. The following is a common first line in an XML document: <?xml version="1.0"?> Note that, unlike SGML processing instructions, XMLPIs end with " ?> "; by using this closing delimiter instead of ">" XML allows the use of the ">" character within the PI.) The XML declaration may also include a standalone document declaration, which answers the question "can the processor use this document without paying attention to the external subset of markup declarations?" <?xml version="1.0" standalone="yes"?> To briefly review the difference between an external subset and internal subset: in the following example, the declarations between the square braces are the internal subset. The declaration for its a element is presumably in the a.dtd file, which is the external subset. <?xml version="1.0"?> <!DOCTYPE b SYSTEM "a.dtd" [ <!ELEMENT b (a,c)> <!ELEMENT c (#PCDATA)> ]> <b><a>like</a><c>wow</c></b> Because a parser needs that a element type declaration to parse the document, this is not a standalone document. The document's XML declaration needs no standalone document declaration, because when an external subset is specified and the XML declaration skips the standalone document declaration, the parser assumes a standalone value of "no." When a document type declaration does not specify an external subset, the standalone document declaration is irrelevant, so you only need to include it when there is an external subset that the parser doesn't need. How?How do we know whether the parser needs the declarations in the external subset? Section 2.9 of the XML 1.0 specification, "Standalone Document Declaration," lists four conditions. First, a parser may need to look up default attribute values. For example, if the chapter element type has the following attribute list specification, <!ATTLIST chapter flavor CDATA "mint"> then a chapter element in a document using this declaration has a flavor value of "mint" if its start tag doesn't explicitly specify any attribute values ( <chapter> ) because that's the default value. How does a processing application know this? From looking at the chapter element type's attribute list declaration, which specifies "mint" as the default value. If it must look outside the document in an external declaration subset to find this declaration, then it's not a standalone document. Entities usually require access to declarations as well. (I say "usually" because the XML specification lists five predeclared entities for commonly needed characters: amp, lt, gt, apos, and quot for the ampersand, less-than, greater-than, apostrophe, and quotation characters.) If an XML processor finds the entity reference &yuzz; in a document, it needs access to the yuzz entity's declaration to see what it refers to. If it has to look outside the document in an external declaration subset to find this declaration, then it's not a standalone document. The third condition that rules out standalone status is the use of attributes that are, in the words of the specification, "subject to normalization." Some attribute values can refer to entities, and normalization is essentially the process of resolving these entities. As with most other entity references, an XML processor needs the attribute value entities' declarations to see what they refer to, and a need to look outside the document at an external declaration subset means that the document is not a standalone one. The final possibility for preventing a document from being a standalone XML document is the possibility of elements with element content. An element consisting of element content has only other elements as children, with no character data (for example, no PCDATA that is not part of any child element). The following thnad element does not have element content because it has character data around its emph element: <thnad>This element is <emph>not</emph> element content.</thnad> Any carriage returns, tabs, and other white space between the child elements of element content are only there to make it easier for people to read the marked-up data; they are not themselves data. Knowing this, would you consider the following yuzz element to have only element content? <yuzz> <wum>garlic</wum> <nuh>sapphires</nuh> <glikk>mud</glikk> </yuzz> It's a trick question-you can't tell whether it's element content or mixed content without looking at its declaration. If it was declared with the following declaration, then it's element content, and the carriage returns between the child elements should be ignored: <!ELEMENT yuzz (wum|nuh|glikk)*> If it was declared with the following declaration, however, that same yuzz element is still perfectly valid: <!ELEMENT yuzz (#PCDATA|wum|nuh|glikk)*> Does the yuzz element have any PCDATA in it? If a document type declaration used that second declaration to define it, then it does: the carriage returns between the child elements qualify as PCDATA, not disposable formatting space, and the application needs to know about them. An XML processor passes along all white space to an application, but must still let the application know which white space is significant and, when possible, which is "pretty printing" white space. Because the processor needs access to the element type declaration to know whether it should treat such an element as having element content, if it needs to look outside the document at an external subset for that declaration, the document is not a standalone one. What kind of external subset wouldn't have any declarations meeting these conditions? Anyone who's used a large, complex DTD such as DocBook or the Text Encoding Initiative DTD knows that a document doesn't have to use everything in the DTD. If it doesn't use any elements or entities declared in the external subset, it's a candidate for standalone status. One of XML's differences from SGML opens up another possibility: because XML lets you declare multiple attribute lists for the same element type, you could have one in a document's external subset and another in its internal subset. If none of the document's elements of a particular type specify values for the attributes declared in the external subset (and, as noted above, if none of those attributes have default values), then you may have a standalone document. Why?Why should a document be standalone? Because a self-contained document with no dependencies on other entities (for example, other files) is easier to send across a network with full confidence that nothing will go wrong when a browser or other application processes it at the client end. On the other hand, if a server must send out many documents that have large sections in common with each other, storing them each as standalone documents is very inefficient-would you really want to store all of the declarations from a given DTD in each document that uses those declarations? Because standalone and non-standalone documents can each be so useful in certain situations, the XML specification stipulates that standalone="no" documents can be algorithmically converted to a standalone document. In other words, it can be mechanically converted with no human intervention. This way, a server can assemble a standalone document from the necessary pieces (perhaps by merely copying external subset declarations to a document's internal subset) for shipping to a client. To allow the most efficient possible storage and transmission of XML documents, this conversion will be a standard feature of XML document servers. |


