Interoperable HTML Parsing in IE9

SGT Oddball · Sep 13, 2010

The HTML parser is an important part of how we deliver on same markup because it plays a vital role in how the DOM is constructed. Therefore, it also plays a big role in how any DOM API or CSS rule is applied. While we’ve talked a lot about some of the high-profile API improvements in IE9 – getElementsByClassName, addEventListener, and so on – one important improvement we haven’t talked about is the HTML parser.

This is clearly important for developers, so we made interoperability improvements to our HTML parser in IE9 Standards Mode. This blog post provides practical guidance on how these improvements affect your site and how to avoid pitfalls in areas where all browsers still don’t behave the same way.

innerHTML

Originally introduced as IE-proprietary APIs, innerHTML and outerHTML have gained some early traction as standards and are widely implemented by other browsers, but with some differences. These methods are unusual among DOM APIs in that they invoke the parser. In IE9 we made changes to address the most common interoperability issues.

Much of the work we did here was simplifying our behavior internally. Prior to IE9, we took whatever input was passed to innerHTML/outerHTML and treated it as if it were the only content in an otherwise blank page (resulting in an implicit , , , etc.). We then attempted to merge this page back into the calling element, which sometimes resulted in an “Unknown Runtime Error.”

In IE9, we improved the behavior to support more cases while removing all occurrences of “Unknown Runtime Error.” In cases that still don’t work, you’ll get a descriptive DOMException instead.

While the mainstream scenarios work pretty well across browsers, these APIs are still evolving and interop isn’t perfect in every case. For example, the following has different results in different browsers:

var img = document.getElementsByTagName(‘img’)[0];img.innerHTML = “image text”;
The element can’t have children, so the above doesn’t work in Chrome, Safari, or IE8, and has different behavior altogether in FF3.6. In IE9, Opera, and FF4 Beta, cases like this work as expected, and the text node is inserted properly.

In order to avoid problems with innerHTML, it’s a good idea to only feed it markup that can stand on its own. For example, calling div.innerHTML = “

” is fine, because <div> and can exist without each other.

For small edits, you can also use DOM Core APIs like appendChild.

Generic Elements

One request from developers is having better support for generic elements. A generic element has the same syntax as any other element, but a tag name that isn’t defined in HTML (for example, ). IE9 Standards Mode follows the HTML5 spec and treats generic elements much like <span> tags. This means you can add more descriptive tag names to your page and style them as you would any other element:

IE9
This allows you to semantically describe the content of your page without losing any of the power you have with normal elements, using the same code as you would in other browsers.

Whitespace

One change that affects almost every page is how we parse whitespace. While IE8 removes or collapses whitespace, IE9 persists all whitespace into the DOM at parse-time. So the following markup:

IE 9

Was represented in the IE8 DOM as:

div|->span|--->”IE 9”
And is represented in the IE9 DOM as (whitespace in red):

div|->”\n“|->span|--->”IE\t9”|->“\n”
If your site depends on the existence or non-existence of whitespace, this change has substantial impact. The document structure will contain far more whitespace nodes, so APIs like firstChild might not reference the same node they used to. Another consideration is text node length. Because whitespace is now preserved within text nodes, the character index within a string might be different from what you’re expecting.

IE9’s behavior matches the HTML5 spec and interoperates with other browsers. There are ways this behavior can make your page more fragile, depending on how you use whitespace in your markup. Here are a few suggestions for avoiding these problems:

For scenario where you just want elements, use the Element Traversal APIs – calling functions such as firstElementChild to ensure you don’t reference a stray newline character by mistake.

For scenarios where you need more than just elements, like text nodes, use explicit type-checking via nodeType or a similar API. Depending on why you’re accessing individual characters in a text node, the split() method on JavaScript’s String object could be quite useful for isolating the parts of a string you want to examine.

Overlapping Tags

As web developers, we don’t like to admit it, but we’ve probably all written the following markup at some point:

important text

Overlapped tags are a far more common occurrence than you might think, partly because they’re not always as obvious as the example above. Take the markup below:

text

The element can’t legally contain a <div>, so IE, Firefox, Chrome, and Safari implicitly close the . It’s almost as if you’d given this markup to the parser instead:

text

Notice that you didn’t even have to overlap your tags to end up in an overlapping tags scenario (the <b> element, in this case). This is just one edge case -- as you explore more scenarios, you’ll find that they can get pretty complex.

If you open up the IE8 Developer Tools to inspect the markup above, you’ll see this structure:

p|->b|--->div|----->”text”div|->”text”p
It seems reasonable enough, but there’s actually more going on beneath the surface. In previous versions of IE, we persist the overlapped markup more or less as written – meaning an overlapped element could occupy more than one position in the DOM tree.

This state – called an inclusion – occasionally leads to behavior difference across browsers, especially when using script to walk the tree. For example, calling nextSibling on the <b> element above will return the second and calling firstChild will return null. This occurs in spite of the fact that the <b> element appears to be a parent of <div> and have no siblings.

We improved IE9 mode to resolve such situations at parse-time to avoid these side-effects. In any place where earlier versions of IE would create an inclusion, IE9 creates a clone of the element instead.

So the markup from the example above would exist in the IE9 DOM as:

p|->bb|--->div|----->”text”p
IE9 clones the <b> tag when it sees the implicit

end tag. Thus, the DOM contains two distinct <b> elements, matching Chrome and Safari in this case. The HTML5 algorithm (supported by FF4 Beta) differs in that it clones overlapped elements upon encountering the next text node – resulting in a slightly different DOM structure above.

In order to avoid these types of problems in the first place, it’s a good idea to run your markup through the W3C’s online validator to help spot these kinds of problems before they become real bugs. For convenience, IE’s F12 Developer Tools have a built-in link to pass a site through the W3C’s validator.

Title Element

In IE8 and earlier versions of IE, the parser implicitly creates a element whenever it encounters a . As a result, developers in IE8 can assume that head.firstChild returns a element, even if you don’t explicitly declare one in your markup.

In IE9, we made an interoperability change to respect the element’s position in the , like other major browsers.

Much like whitespace handling, this could result in your site behaving differently in IE9 than previous version of IE if you write applications that depend on the first child of your element always being .

If you need to grab the title, a better approach would be getElementsByTagName.

Object Element

Historically, the element’s behavior in IE has been rather idiosyncratic, largely due to the fact that web sites often use it to interface with native code running outside the browser sandbox. In IE9, we’ve improved parsing so it and its contents appear in the DOM like any other element.

For example, any elements or fallback content inside the will be persisted in the DOM, regardless of whether the successfully loads.

This means that calls like the following will now work:

alert(document.getElementsByTagName(‘param’)[0].nodeName)

You shouldn’t have to do anything special to take advantage of our new behavior – but you can now interact with much like you can most other elements and like you can in other browsers.

While these changes may seem less important than adding or changing an API, the impact on real web development is substantial. If you’re a developer, try your site in the latest Platform Preview and look for any problems resulting from the changes above.

As always, please send your feedback via Connect or the comments section.

Thanks!
Jonathan Seitel
Program Manager

More...