ruby / jruby xml stream parser

June 20, 2009

to give sonar an xml input format, for content provided via a rest api or files, i looked around for a ruby xml parser oriented towards parsing large documents : perhaps too large to fit in memory

there are plenty of low-level stream parsers around e.g. in rexml, but they stop some way short of allowing the solution to be expressed in a natural way

here‘s a parser, which sits atop rexml’s pull parser, and allows you to formulate your parse in ruby blocks which straightforwardly process xml elements. what you keep and how you convert is completely specified by those blocks, so you can happily parse an unending document in constant memory

<people>
  <person name="alice">likes cheese</person>
  <person name="bob">likes music</person>
  <person name="charles">likes alice</person>
</people>

can be parsed with :

require 'rubygems'
require 'xml_stream_parser'

people = {}
XmlStreamParser.new.parse_dsl(doc) do
  element "people"  do |name,attrs|
    elements "person" do |name, attrs|
      people[attrs["name"]] = text
    end
  end
end

a plainer api is also supported, allowing a parse to be split over multiple methods [ since parse_dsl uses instance_exec to call blocks, and loses context ]

people = {}
XmlStreamParser.new.parse(doc) do |p|
  p.element( "people" ) do |name,attrs|
    p.elements( "person" ) do |name, attrs|
      people[attrs["name"]] = p.text
    end
  end
end
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: