|
![]() |
|
Article from February, 1998. An XML GlossaryBy Bob DuCharme, XML Correspondent Bob DuCharme is the author of " SGMLCD," a tutorial and users guide to free SGML software available from Prentice Hall as part of Charles Goldfarb's series on Open Information Management. He also contributed to the " SGML Buyer's Guide" in the same series and QUE Publishing's "Using SGML." Abstract
A useful glossary of terms for the implementor of XML systems. XML is a simplified version of SGML, but XML discussions use several terms unfamiliar to experienced SGML users. Some are brand new XML terms. Some are old SGML terms that rarely came up until recently because they describe some aspect of SGML/ XML syntax that has received more attention with XML's increasing popularity; these are labeled " ISO 8879:1986" below. Others are new SGML terms used as part of the WG4's new annexes to ISO 8879 that make SGML a proper superset of XML. These are labeled " WG4" below. character encodingThe association of a character with a numeric value and a bit pattern representation of that number. A given set of encodings represents each character with the same number of bits. For example, the seven-bit ASCII code used on PCs and Macs assigns a lower-case "a" to the value 97 and the bit pattern 110001. For standards such as ISO 10646 that can represent far more than the 128 characters encoded by ASCII, values can be assigned to characters from many international alphabets. empty-element tagAn empty element has no content. XML represents it with either a start-tag immediately followed by an end-tag or by an empty-element tag, which looks like a start-tag with a slash before its closing ">": <img src="klee.jpg"/> That closing slash is one of the key innovations of XML. Now that empty elements look different from a non-empty element's start-tag, applications can recognize empty elements without reading a DTD first. SGML doesn't use this closing slash in empty-element tags-in fact, it doesn't have the concept of an "empty-element tag," instead describing an empty-element's markup as a start-tag without an end-tag ( ISO 8879 7.3) fully-tagged( WG4) In a fully-tagged document instance, every element has a start-tag with a generic identifier, an end-tag, and an attribute name with every attribute specification. "Fully-tagged" is an SGML term, not an XML term, because XML's lack of support for tag minimization means that it takes this condition for granted; SGML uses the term to describe a particular subset of SGML documents that also meet a specific requirement of XML documents. ISO 10646An international standard for representing characters from alphabets used around the world. (See also: Unicode, UCS, and UTF) name spaceA set of names treated as a unit. The use of two overlapping name spaces results in a name space conflict-for example, when a document type uses element types and entities declared in two different external files and one element type is declared differently in the two files. Theoretically, this is no more of an issue in XML than it is in SGML, but in practice, people planning XML systems have been more concerned about the problem of name space conflicts. The XML Working Group is drafting a paper that proposes a syntax to resolve this issue. NESTC( WG4) NET (Null End Tag)-Enabling Start-Tag Close. Closing a start-tag with this allows the use of a null end tag. NESTC is defined as the slash ("/") character, and if the null end-tag is defined as ">" then an empty img element can be written like this: <img src="bozo.jpg"/> Note that that's actually two tags when you consider that "/" is closing the start-tag and that ">" is a null end tag. PIC( ISO 8879:1986) The Processing Instruction Close delimiter. This was rarely an issue until recently because most people, instead of redefining it in their SGML declaration, left it set to the ">" character defined by the reference concrete syntax. In XML, "?>" is the PIC. As developers put more information into processing instructions ( PIs) they can't use the ">" character in PI content if that signifies the end of the PI, so the fact that XMLPIs end with "?>" lets developers use ">" more freely in PIs. This is especially useful if the PI incorporates XML-like syntax; see " PI target" for an example. PI targetAfter a Processing Instruction Open delimiter of "<?" (which XML leaves unchanged from SGML's reference concrete syntax setting), XML requires a processing instruction target, which names the application that the processing instruction is meant for. For example, let's say I have an application called soundlib that locates sound files. I want my XML application to pass it the parameters "a=440" and "Dolby= FALSE" before processing the document elements that use this program, so I might use this processing instruction: <?soundlib <a>440</a> <dolby>FALSE</dolby> ?> Note the use of XML-like syntax to pass the values for a and dolby to the soundlib program. Being within a processing instruction, these "elements" have nothing to do with the content models of the document actually being processed. type-valid( WG4) An SGML document whose instances conform to a declared DTD. Before ISO 8879 Annexes K and L, all SGML documents fell into this category; the expansion of SGML into a proper superset of XML mean that these must now be distinguished from merely fully-tagged documents, which (under Annexes K and L) are considered conforming SGML documents without necessarily being type-valid. UnicodeA standard for representing characters from the alphabets of the world. While this is a separate standard from ISO 10646, the Unicode Consortium and the ISO have worked to keep Unicode's related standards synchronized with the UCS-2 subset of ISO 10646. UCSUniversal Multiple-Octet Coded Character Set, the ISO/ IEC 10646 standards for representing international character sets. They use multiple octets (groups of eight bits) because the 256 possible values of a single octet, like the ASCII byte used on typical PCs, just isn't enough (especially when ASCII only uses seven of those eight bits). 10646's UCS-2 uses two octets per character and UCS-4 uses 4. UCS-2 is potentially a proper subset of UCS-4, but to date no abstract characters have been assigned outside the UCS-2 range. See also: character encoding. UTF-8, UTF-16UCS Transformation Formats 8 and 16: specific ISO 10646 sets of character representations that use multiples of 8 and 16 bits for each character. In the words of the XML specification, "All XML processors must accept the UTF-8 and UTF-16 encodings of 10646." Technically, because UTF-8 and UTF-16 are variable-length representations, they are not "encodings" in the strict sense of the word. validA valid XML document has an associated document type declaration and follows all the rules specified by that declaration. This may seem identical to the SGML sense of the term, but in XML, a document that isn't valid isn't necessarily garbage-it may be merely well-formed, and therefore still useful to many applications. validity constraint, well-formedness constraintIn the XML specification, a condition accompanying a rule for creating a particular XML document component. This condition imposes further requirements to be met by the component being defined. If a parser is to consider the element or entity containing the XML component to be valid, that component must conform to any validity constraints shown; to be considered well-formed, it must meet any well-formedness constraints shown. well-formedThe XML specification has many specific well-formedness constraints, but in general, a well-formed element, document, or entity is one whose component elements and entities are properly nested. Also, a well-formed entity has no partial elements, comments, processing instructions, or character or entity references-in other words, none of these begin in one entity and finish in another. (My thanks to Dave Peterson for his helpful reviews of some of these definitions.) |


