Sunday, August 10, 2008

Converting from REXML to LibXML-Ruby

With the recent resurrection of LibXML-Ruby I decided to investigate converting one of our more XML processing heavy applications from REXML to LibXML-Ruby. LibXML-Ruby is touted to be much faster than REXML, and, I found this to be the case. In the process I kept track of some of the differences between the two that should help you if you decide to do the same. Here are some of the command equivalents between the two:

REXMLLibXML-Ruby
create doc from filenameREXML::Document.new(filename)XML::Document.file(filename)
create doc from fileREXML::Document.new(file)XML::Parser.io(file).parse
grab to root of a docdoc.rootdoc.root
create doc from stringREXML::Document.new(string)
parser = XML::Parser.new
parser.string = string
doc = parser.parse
(no, you can pass a StringIO to XML::Parser.io)
return all elements (not text nodes)node.elementsnode.find('*')
xpath from elementelem[xpath](annoyingly can return a single item or an array)elem.find(xpath)
xpath to the first matchREXML::XPath.first(elem, xpath)elem.find_first(xpath)
grab text content of nodeelem.textelem.content
working with attributes
elem.attributes[attr]
(elem[...] reserved for xpath)
elem[attr]
creating nodes
REXML::Element.new(name)
REXML::Element.new(name, attr_hash)
XML::Node.new(name)
XML::Node.new(name, content)
(can't set attrs on create)
deep clone a nodeelem.deep_copyelem.copy(true)
add a child element
node.elements.add(child)
(child can be node or string)
node.elements.add_element(name, attr_hash)
node << child_node
node.child = child.node
node.child_add(child)
removing elements
parent.remove(child)
parent.delete_element(child)
(child may be Element, String, or Integer)
child.remove!
jump to the next siblingelem.next_elementelem.next
can XPath node not in a document?yesno
can add node directly from one document to anotheryesno


Ok, for those of you that actually read all the way through the table the bonus is right down here, because the biggest difference between REXML and LibXML-Ruby is in the handling of default namespaces. A default namespace is a namespace placed on an XML document that acts as the default, that is it doesn't use a prefix. A good example of this is KML documents which are often defined like this:



With REXML, you can use XPath expressions with the assumption that you are referencing the default namespace and they will just work - no prefix necessary. With LibXML-Ruby, this is not the case. Say you have a reference to a node with LibXML-Ruby, and you want to run some XPath on it, with LibXML-Ruby you will be forced to do something like this:



I found an approach of registering a prefix for the default namespace on Bogle's Blog. While this is nice, you still can't register this once for the whole document, but must do it on each node you will be running an XPath expression on.

(On another note, did I just remove all carriage returns from my table to make blogger happy? Why yes, yes I did.)