ruby / jruby xml stream parser
June 20, 2009
to give sonar an xml input format, for content provided via a rest api or files, i looked around for a ruby xml parser oriented towards parsing large documents : perhaps too large to fit in memory
there are plenty of low-level stream parsers around e.g. in rexml, but they stop some way short of allowing the solution to be expressed in a natural way
here‘s a parser, which sits atop rexml’s pull parser, and allows you to formulate your parse in ruby blocks which straightforwardly process xml elements. what you keep and how you convert is completely specified by those blocks, so you can happily parse an unending document in constant memory
<people> <person name="alice">likes cheese</person> <person name="bob">likes music</person> <person name="charles">likes alice</person> </people>
can be parsed with :
require 'rubygems' require 'xml_stream_parser' people = {} XmlStreamParser.new.parse_dsl(doc) do element "people" do |name,attrs| elements "person" do |name, attrs| people[attrs["name"]] = text end end end
a plainer api is also supported, allowing a parse to be split over multiple methods [ since parse_dsl
uses instance_exec
to call blocks, and loses context ]
people = {} XmlStreamParser.new.parse(doc) do |p| p.element( "people" ) do |name,attrs| p.elements( "person" ) do |name, attrs| people[attrs["name"]] = p.text end end end