XML Encoding and Java

I came across a Java based blog and noticed a post about XML processing. It appears that the community around that blog is not familiar with the character encoding policies of the XML parsers available in Java. The fact of the matter is that the XML protocol has provisions for specifying the document encoding right in the document. It’s akin to HTML’s <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/> which causes user agents to restart the parsing of the HTML document containing this instruction with the provided character set in mind.

The Provisions

So what are these provisions? — we’re talking about the xml processing instruction, which accepts the encoding attribute. Similar to a HTML user agent, a SAX or DOM parser reading an XML document from a java.io.InputStream will restart parsing the XML document based on the provided character set.

What does this mean for you? It means that you can relieve the application from the burden of maintaining knowledge about the encoding of XML documents, externalizing the knowledge into the XML documents themselves. And this works for Apache Digester, Apache Betwixt—pretty much any tool building upon SAX or DOM.

And if we cannot guarantee the presence of the encoding in XML documents? Well, a non-normative section of the XML 1.0 specification states that the encoding or a family of encodings can still be detected reliably in various circumstances, by inspecting the first few bytes of a document. So then, a question is in order…

Which Way to Go?

So should we use java.io.Readers, or java.io.InputStreams as providers of XML data? If the XML is produced and consumed in a closed environment and its generation and encoding can be fully controlled, then you can use a java.io.Reader. Alternately, you can ensure that an encoding is always specified within all these XML documents, enabling the use of java.io.InputStream.

If, on the other hand, you do not control the source of the XML documents, using a java.io.Reader has two notable consequences. Firstly, you must make a commitment about the encoding immediately. Secondly, the parser will not have access to raw bytes and will be unable to infer an encoding by inspecting bytes. In either case, you must enforce policies that guarantee the presence of encoding information, either in the XML document—which enables use of java.io.InputStream, or external to the document—requiring the use of a java.io.Reader.

Conflicts and Prioritization

Another non-normative section in the same XML 1.0 Specification talks about the handling of multiple sources of encodings. For example, an application may be provided with an XML document along with an encoding as part of a communications protocol, whereas the XML itself may also specify an encoding. The specification states that prioritization is up to the protocol used to deliver the XML. While I cannot see a reason not to always prefer an embedded encoding when present, this is the specification.

How do the Java XML APIs handle this situation? Well, the moment you provide a Reader to a parser, a commitment as to the encoding has already been made, hence you immediately choose the externally supplied encoding. Should you provide an InputStream and ignore the external encoding instead, you instruct the parser to honor the encoding supplied in the XML document. There doesn’t seem to be a way to instruct the parser to fall back to an external encoding if the XML document does not supply its own.

 

Post a Comment

You must be logged in to post a comment.