xml

Namespaces: Obfuscating Xml for fun and profit

One reason Xml is hated by many is namespaces. While the concept is incredibly useful and powerful, the implementation, imho, is a prime example of over-engineered flexibility: It's so flexible that you can express the same document in a number of radically different ways that are difficult to distinguish with the naked eye. This flexibility then becomes the downfall of many users, as well as simplistic parsers, trying to write XPath rather than walking the tree looking at localnames.

Making namespaces confusing

Conceptually, it seems very useful to be able to specify a namespace for an element so that documents from different authors can be merged without collision and ambiguity. And if this declaration was a simple unique map from prefix to Uri, it would be a useful system. You see a prefix, you know know it has a namespace that was defined somewhere earlier in the document. Ok, it could also be defined in the same node – that's confusing already.

But that's not how namespaces work. In order to maximize flexibility, there are a number of aspects to namespacing that can make them ambiguous to the eye. Here are what I consider the biggest culprits in muddying the waters of understanding:

Prefix names are NOT significant

Let's start with a common misconception that sets the stage for most comprehension failures that follow, i.e that the prefix of an element has some unique meaning. The below snippets are identical in meaning:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/">
    <b>foo</b>
  </xsl:template>
</xsl:stylesheet>
<a:stylesheet version="1.0" xmlns:a="http://www.w3.org/1999/XSL/Transform">
  <a:template match="/">
    <b>foo</b>
  </a:template>
</a:stylesheet>

The prefix is just a short alias for the namespace uri. I chose xsl because there are certain prefixes like xsl, xhtml, dc, etc, that are used consistently with their namespace uri's that a lot of people assume that the name is significant. But it isn't. Someone may give you a document with their favorite prefix and on first look, you'd think the xml is invalid.

Default Namespaces

Paradoxically, default namespaces likely came about to make namespacing easier and encourage their use. If you want your document to not conflict with anything else, it's best to declare a namespace

<my:a xmlns:my="ns1"/>
  <my:b>blah</my:b>
</my:a>

But that's just tedious. I just want to say "assume that everything in my document is in my namespace":

<a xmlns="ns1"/>
  <b>blah</b>
</a>

Beautiful. I love default namespaces!

Ah, but wait, there's more! A default namespace can be declared on any element and governs all its children. Yep, you can override previous defaults and elements at the same hierarchy level could have different namespaces without looking different:

<a xmlns="ns1"/>
  <b xmlns="ns2">
    <c>blah</c>
  </b>
  <b xmlns="ns3">
    <c>blah</c>
  </b>
</a>

Here it looks like we have a with two child elements b, each with an element c. Except not only is the first b really {ns2}b and the seconds b {ns3}b, but even worse, the c elements which have no namespace declaration are also different, i.e. {ns2}c and {ns3}c. This smells of someone being clever. It looks like a feature serving readibility when it does exactly the opposite. Use this in larger documents with some more nesting and the only way you can determine whether and what namespace an element belongs to is to use a parser. And that defeats the human readibility property of Xml.

Attributes do not inherit the default namespace

As if default namespaces didn't provide enough obfuscation power, there is a special exception to them and that's attributes:

<a xmlns="ns1"/>
  <b c="who am i">blah</b>
</a>

So you'd think this is equivalent to:

<x:a xmlns:x="ns1"/>
  <x:b x:c="who am i">blah</x:b>
</x:a>

But you'd be wrong. @c isn't @x:c, it's just @c. It's without namespace. The logic goes like this: Namespaces exist to uniquely identify nodes. Since an attribute is already inside a uniquely identifyable container, the element, it doesn't need a namespace. The only way to get a namespace on an attribute is to use an explicit prefix. Which means that if you wanted @c to have be in the namespace {ns1} , but not force every element to declare the prefix as well, you'd have to write it like this:

<a xmlns="ns1"/>
  <b x:c="who am i" xmlns:x="ns1">blah</b>
</a>

Oh yeah, much more readable. Thanks for that exception to the rule.

Namespace prefixes are not unique

That last example is a perfect segway into the last, oh, my god, seriously?, obfuscation of namespacing: You can declare the same namespace multiple times with different prefixes and, even more confusingly you can define the same prefix with different namespaces.

<x:a xmlns:x="ns1">
  <x:b xmlns:x="ns2">
    <x:c xmlns:x="ns1">you don't say</x:c>
  </x:b>
  <y:b xmlns:y="ns1">
    why would you do this?
  </y:b>
</x:a>

Yes, that is legal AND completely incomprehensible. And yes, people aren't likely to do this on purpose, unless they really are sadists. But I've come across equivalent scenarios where multiple documents were merged together without paying attention to existing namespaces. In fairness, trying to understand existing namespaces on merge is a pain, so it might have been purely done in self-defense. This is the equivalent of spaghetti code and it's enabled by needless flexibility in the namespace system.

XPath needs unambiguous names

So far i've only addressed the ambiguity in authoring and in visually parsing namespaced Xml, which has plenty of painpoints just in itself. But now let's try to find something in one of these documents.

<x:a xmlns:x="ns1">
  <x:b xmlns:x="ns2">
    <x:c xmlns:x="ns1">you don't say</x:c>
  </x:b>
  <y:b xmlns:y="ns1">
    why would you do this?
  </y:b>
</x:a>

Let's get the c element with this xpath:

/x:a/x:b/x:c

But that doesn't return any results. Why not? The main thing to remember with XPath is that, again, prefixes are NOT signficant. That means, just because you see a prefix used in the document doesn't actually mean that XPath can find it by that name. Again, why not? Indeed. After all, the x prefix is defined, so why can't XPath just use that mapping? Well, remember about this example that depending on where you are in the document, x means something different. XPath doesn't work contextually, it needs unique names to match. Internally, XPath needs to be able to convert the element names into fully qualified names before ever looking at the document. That means what it really wants is a qury like this:

/{ns1}a/{ns2}b/{ns1}c

Since namspaces can be used in all sorts of screwy ways to use the same prefixes to mean different things contextually, the prefixes seen in the text representation of the document are useless to XPath. Instead, you need to define manual, unique mappings from prefix to namespace, i.e. you need to provide a unique lookup from prefix to uri. Gee, unique prefix.. Why couldn't the Xml document spec for namespaces have respected that requirement as well.

Namespace peace of mind: Be explicit and unique

The best you can do to keep namespacing nightmares at bay is to follow 2 simple rules for formatting and ingesting Xml:

  1. Only use default namespacing on the root node
  2. Keep your prefixes unique (preferably across all documents you touch)

There, done, ambiquity is gone. Now make sure you normalize every Xml document that passes through your hands by these rules and bathe in the light of transparency. It's easier to read, and you can initialize XPath with that global nametable of yours so that your XPath represenation will match your rendered Xml representation.

By arne on | geek, rant | 6 comments
Tags: , , ,

A case for XML

XML gets maligned a lot. It’s enterprisey, bloated, overly complex, etc. And the abuses visited upon it, like trying to express flow control or whole DSLs in it or being proposed as some sort of panacea for all interop problems only compound this perception. But as long as you treat it as what it is, data storage, I generally can find little justification to use something else. Not because it’s the best, but because it’s everywhere.

If you are your own consumer and you want a more efficient data storage, just go binary already. If you’re not, then I bet your data consumers are just tickled that they have to add another parser to their repository of data ingestors. Jim Clark probably put it best when he said:

“For the payload format, XML has to be the mainstay, not because it’s technically wonderful, but because of the extraordinary breadth of adoption that it has succeeded in achieving. This is where the JSON (or YAML) folks are really missing the point by proudly pointing to the technical advantages of their format: any damn fool could produce a better data format than XML.”

Ok, I won’t get religious on the subject, but mostly wanted to give a couple of examples, where the abilities and the adoption of XML have been a godsend for me. All this does assume you have a mature XML infrastructure. If you’re dealing with XML via SAX or even are doing the parsing and writing by hand, then you are in a world of hurt, I admit. But unless it’s a memory constraint there really is no reason to do that. Virtually every language has an XML DOM lib at this point.

I love namespaces

One feature a lot of people usually point to when they decry XML to me is namespaces. They can be tricky, i admit, and a lot of consumers of XML don’t handle them right, causing problems. Like Blend puking on namespaces that weren’t apparently hardcoded into its parser. But very simply, namespaces let you annotate an existing data format without messing with it.

<somedata droog:meta="some info about somedata">
  <droog:metablock>And a whole block of extra data</droog:metablock>
</somedata>

Here’s the scenario. I get data in XML and need to reference metadata for processing further down the pipeline. I could have ingested the XML and then written out my own data format. But that would mean I’d have to also do the reverse if I wanted to pass the data along or return it after some modifications and I have to define yet another data format. By creating my own namespace, I am able to annotate the existing data without affecting the source schema and I can simply strip out my namespace when passing the processed data along to someone else. Every data format should be so versatile.

Transformation, Part 1: Templating

When writing webapps, there are literally dozens of templating engines and there’s constantly new ones emerging. I chose to learn XSLT some years back because I liked how Cocoon and AxKit handled web pages. Just create your data in XML and then transform it using XSLT according to the delivery needs. So far, nothing especially unique compared to other templating engines. Except unlike most engines, it didn’t rely on some program creating the data and then invoking the templating code. XSLT works with dynamic Apps as easily as with static XML or third party XML without having.

Since those web site roots, I’ve had need for email templating and data transformation in .NET projects and was able to leverage the same XSLT knowledge. That means I don’t have to pick up yet another tool to do a familiar task just a little differently.

What’s the file format?

When I first started playing with Xaml, I was taking Live For Speed geometry data and wanted to render it in WPF and Silverlight. Sure, I had to learn the syntax of the geometry constructs, but I didn’t have to worry about figuring out the data format. I just used the more than familiar XmlDocument and was able to concentrate on geometry, not file formats.

Transformation, Part 2: Rewriting

Currently I’m working with Xaml again for a Silverlight project. My problem was that I had data visualization in Xaml format (coming out of Illustrator), as well as associated metadata (a database of context data) and I needed to attach the metadata to the geometry, along with behavior. Since the first two are output from other tools I needed a process that could be automated. One way would be to walk the Visual tree once loaded, create a parallel hierarchy of objects containing the metadata and behavior and attach their behavior to the visual tree. But i’d rather have the data do this for itself.

<Canvas x:Name="rolloverContainer_1" Width="100" Height="100">
  <!-- Some geometry data -->
</Canvas>

<!-- becomes -->

<droog:RolloverContainer x:Name="rolloverContainer_1" Width="100" Height="100">
  <!-- Some geometry data -->
</droog:RolloverContainer>

So I created custom controls that subclassed the geometry content containers. I then created a post-processing script that simply loaded the Xaml into the DOM and rewrote the geometry containers as the appropriate custom controls using object naming as an identifying convention. Now the wiring happens automatically at load, courtesy of Silverlight. Again, no special parser required, just using the same XmlDocument class I’ve used for years.

And finally, Serialization

I use XML serialization for over the wire transfers as well as data and configuration storage. In all cases, it lets me simply define my DTOs and use them as part of my object hierarchy without ever having to worry about persistence. I just save my object graph by serializing it to XML and rebuild the graph by deserializing the stream again.

I admit that this last bit does depend on some language dependent plumbing that’s not all that standard. In .NET, it’s built in and let’s me mark in my objects with attributes. In Java, I use Simple for the same effect. Without this attribute driven mark up, I’d have to walk the DOM and build m objects by hand, which would be painful.

Sure, for data, binary serialization would be cheaper and more compact, but that misses the other benefits I get for free. The data can be ingested and produced by a wide variety of other platforms, I can manually edit it, or easily build tools for editing and generation, without any specialized coding.

For my Silverlight project, I’m currently using JSON as my serialization layer between client and server, since there currently is no XmlSerializer or even XmlDocument in Silverlight 1.1. It, too, was painless to generate and ingest and, admittedly, much more compact. But I then I added this bit to my DTO:

List<IContentContainer> Containers = new List<IContentContainer>();

It serialized just fine, but then on the other end it complained about there not being a no-argument constructor for IContentContainer. Ho Hum. Easily enough worked around for now, but I will be switching back to XML for this once Silverlight 2.0 fleshes out the framework. Worst case, I’ll have to build XmlSerializerLitem, or something like that, myself.

All in all, XML has allowed me to do a lot of data related work without having to constantly worry about yet another file format, or parser. It’s really not about being the best format, but about it virtually being everywhere and being supported with a mature toolchain across the vast majority of programming environment and that pays a lot of dividents, imho.

By arne on | geek, rant | A comment?
Tags: , , ,

Emacs, Nxml-Mode and Unicode

I’ve run into this too many times and fixed just as many.. grr. And i always forget.. Time to write it down.

I use James Clark’s excellent nxml-mode to edit pretty much anything that’s vaguely XML, i.e. i usually convert HTML i have to edit into XHTML so i can use this mode.

Problem is, if you just start writing XML in that mode, the resulting file will be Unicode encoded. There is a fix to this. Write proper XML :) Basically, first add this to your .emacs file:

(unify-8859-on-decoding-mode)

Next, make sure you got your proper XML header:

<?xml version="1.0" encoding="utf-8"?>

Tada! File is saved in proper form.

Just to be all proper and stuff, I use this header for my html/xhtmlL

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
By arne on | geek | A comment?
Tags: ,