David Burger's Courtesy Flush: 08/10/08

Sunday, August 10, 2008

Converting from REXML to LibXML-Ruby

With the recent resurrection of LibXML-Ruby I decided to investigate converting one of our more XML processing heavy applications from REXML to LibXML-Ruby. LibXML-Ruby is touted to be much faster than REXML, and, I found this to be the case. In the process I kept track of some of the differences between the two that should help you if you decide to do the same. Here are some of the command equivalents between the two:

	REXML	LibXML-Ruby
create doc from filename	REXML::Document.new(filename)	XML::Document.file(filename)
create doc from file	REXML::Document.new(file)	XML::Parser.io(file).parse
grab to root of a doc	doc.root	doc.root
create doc from string	REXML::Document.new(string)	parser = XML::Parser.new parser.string = string doc = parser.parse (no, you can pass a StringIO to XML::Parser.io)
return all elements (not text nodes)	node.elements	node.find('*')
xpath from element	elem[xpath](annoyingly can return a single item or an array)	elem.find(xpath)
xpath to the first match	REXML::XPath.first(elem, xpath)	elem.find_first(xpath)
grab text content of node	elem.text	elem.content
working with attributes	elem.attributes[attr] (elem[...] reserved for xpath)	elem[attr]
creating nodes	REXML::Element.new(name) REXML::Element.new(name, attr_hash)	XML::Node.new(name) XML::Node.new(name, content) (can't set attrs on create)
deep clone a node	elem.deep_copy	elem.copy(true)
add a child element	node.elements.add(child) (child can be node or string) node.elements.add_element(name, attr_hash)	node << child_node node.child = child.node node.child_add(child)
removing elements	parent.remove(child) parent.delete_element(child) (child may be Element, String, or Integer)	child.remove!
jump to the next sibling	elem.next_element	elem.next
can XPath node not in a document?	yes	no
can add node directly from one document to another	yes	no

Ok, for those of you that actually read all the way through the table the bonus is right down here, because the biggest difference between REXML and LibXML-Ruby is in the handling of default namespaces. A default namespace is a namespace placed on an XML document that acts as the default, that is it doesn't use a prefix. A good example of this is KML documents which are often defined like this:

With REXML, you can use XPath expressions with the assumption that you are referencing the default namespace and they will just work - no prefix necessary. With LibXML-Ruby, this is not the case. Say you have a reference to a node with LibXML-Ruby, and you want to run some XPath on it, with LibXML-Ruby you will be forced to do something like this:

I found an approach of registering a prefix for the default namespace on Bogle's Blog. While this is nice, you still can't register this once for the whole document, but must do it on each node you will be running an XPath expression on.

(On another note, did I just remove all carriage returns from my table to make blogger happy? Why yes, yes I did.)