REXML stands for "Ruby Electric XML". Sorry. I'm not very creative when it comes to working names for my software, and I invariably use the working names as the final product names. The "Ruby" comes from the Ruby language, obviously. The "Electric XML" comes from the inspiration for this project, the Electric XML Java processing library.
This software is distribute under the Ruby license.
1.2.5: Bug fixes: doctypes that had spaces between the closing ] and > generated errors. There was a small bug that caused too many newlines to be generated in some output. Eelis van der Weegen (what a great name!) pointed out one of the numerous API errors. Julian requested that add_attributes take both Hash (original) and array of arrays (as produced by StreamListener). I killed the mailing list, accidentally, and fixed it again. Fixed a bug in next_sibling, caused by a combination of mixing overriding <=>() and using Array.index().
1.2.4: Changes since 1.1b: 100% OASIS valid tests passed. UTF-8/16 support. Many bug fixes. to_a() added to Parent and Element.elements. Updated tutorial. Added variable IOSource buffer size, for stream parsing. delete() now fails silently rather than throwing an exception if it can't find the elemnt to delete. Added a patch to support REXMLBuilder. Reorganized file layout in distribution; added a repackaging program; added the logo.
1.1b: Changes since 1.1a: Stream parsing added. Bug fixes in entity parsing. New XPath implementation, fixing many bugs and making feature complete. Completed whitespace handling, adding much functionality and fixing several bugs. Added convenience methods for inserting elememnts. Improved error reporting. Fixed attribute content to correctly handle quotes and apostrophes. Added mechanisms for handling raw text. Cleaned up utility programs (profile.rb, comparison.rb, etc.). Improved speed a little. Brought REXML up to 98.9% OASIS valid source compliance.
Why REXML? There, at the time of this writing, already two XML parsers for Ruby. The first is a Ruby binding to a native XML parser. This is a fast parser, using proven technology. However, it isn't very portable. The second is a native Ruby implementation, and as useful as it is, it has (IMO) a difficult API.
I have this problem: I dislike obscifucated APIs. There are several XML parser APIs for Java. Most of them follow DOM or SAX, and are very similar in philosophy with an increasing number of Java APIs. Namely, they look like they were designed by theorists who never had to use their own APIs. The extant XML APIs, in general, suck. They take a markup language which was specifically designed to be very simple, elegant, and powerful, and wrap an obnoxious, bloated, and large API around it. I was always having to refer to the API documentation to do even the most basic XML tree manipulations; nothing was intuitive, and almost every operation was complex.
Then along came Electric XML.
Ah, bliss. Look at the Electric XML API. First, the library is small; less that 500K. Next, the API is intuitive. You want to parse a document? doc = new Document( some_file ). Create and add a new element? element = parent.addElement( tag_name ). Write out a subtree?? element.write( writer ). Now how about DOM? To parse some file: parser = new DOMParser(); parser.parse( new InputSource( new FileInputStream( some_file ) ) ). Create a new element? First you have to know the owning document of the to-be-created node (can anyone say "global variables, or obtuse, multi-argument methods"?) and call element = doc.createElement( tag_name ). Then you get to call parent.appendChild( element ). "appendChild"? Where did they get that from? How many different methods do we have in Java in how many different classes for adding children to parents? addElement()? add()? put()? appendChild()? Heaven forbid that you want to create an Element elsewhere in the code without having access to the owning document. I'm not even going to go into what travesty of code you have to go through to write out an XML sub-tree in DOM.
So, I use Electric XML extensively. It is small, fast, and intuitive. IE, the API doesn't add a bunch of work to the task of writing software. When I started to write more software in Ruby, I needed an XML parser. I wasn't keen on the native library binding, "XMLParser", because I try to avoid complex library dependancies in my software, when I can. For a long time, I used NQXML, because it was the only other parser out there. However, the NQXML API can be even more painful than the Java DOM API. Almost all element operations requires accessing some indirect node access... you had to do something like element.node.attr['key'], and it is never obvious to me when you access the element directly, or the node.. or, really, why they're two different objects, anyway. This is even more unfortunate since Ruby is so elegent and intuitive, and bad APIs really stand out. I'm not, by the way, trying to insult NQXML; I just don't like the API.
I wrote the people at TheMind (Electric XML... get it?) and asked them if I could do a translation to Ruby. They said yes. After a few weeks of hacking on it for a couple of hours each week, and after having gone down a few blind alleys in the translation, I had a working beta. IE, it parsed, but hadn't gone through a lot of strenuous testing. Along the way, I had made a few changes to the API, and a lot of changes to the code. First off, Ruby does iterators differently than Java. Java uses a lot of helper classes. Helper classes are exactly the kinds of things that theorists come up with... they look good on paper, but using them is like chewing glass. You find that you spend 50% of your time writing helper classes just to support the other 50% of the code that actually does the job you were trying to solve in the first place. In this case, the Java helper classes are either Enumerations or Iterators. Ruby, on the other hand, uses blocks, which is much more elegant. Rather than:
for (Enumeration e=parent.getChildren(); e.hasMoreElements(); ) { Element child = (Element)e.nextElement(); // Do something with child }
you get:
parent.each_child{ |child| # Do something with child }
Can't you feel the peace and contentment in this block of code? Ruby is the language Buddha would have programmed in.
Anyhoo, I chose to use blocks in REXML directly, since this is more common to Ruby code than for x in y ... end, which is as orthoganal to the original Java as possible.
Also, I changed the naming conventions to more Ruby-esque method names. For example, the Java method getAttributeValue() becomes in Ruby get_attribute_value(). This is a toss-up. I actually like the Java naming convention more, but the latter is more common in Ruby code, and I'm trying to make things easy for Ruby programmers, not Java programmers.
The biggest change was in the code. The Java version of Electric XML did a lot of efficient String-array parsing, character by character. Ruby, however, has ubiquitous, efficient, and powerful regular expression support. All regex functions are done in native code, so it is very fast, and the power of Ruby regex rivals that of Perl. Therefore, a direct conversion of the Java code to Ruby would have been more difficult, and much slower, than using Ruby regexps. I therefore used regexs. In doing so, I cut the number of lines of sourcecode by half1.
Finally, by this point the API looks almost nothing like the original Electric XML API, and practically none of the code is even vaguely similar. However, even though the actual code is completely different, I did borrow the same process of processing XML as Electric, and am deeply indebted to the Electric XML code for inspiration.
Run 'ruby install.rb'. By the way, you really should look at these sorts of files before you run them as root. They could contain anything, and since (in Ruby, at least) they tend to be mercifully short, it doesn't hurt to glance over them.
Please see the Tutorial
The API documentation is here. Some examples using REXML are included in the distribution archive, and the Tutorial provides examples with commentary.
Unfortunately, NQXML is the only package REXML can be compared against; XMLParser uses Jade, which is a native library, and really is a different beast altogether. So in comparing NQXML and REXML you can look at three things: speed, size, and API.
REXML is faster than NQXML in some things, and slower than NQXML in a couple of things. You can see this for yourself by running the supplied benchmarks, although it may not be clear what operations are slower from these. Most of the places where REXML are slower are because of the convenience methods3. On the positive side, most of the convenience methods can be bypassed if you know what you are doing. Check the benchmark comparison page for a general comparison. You can look at the benchmark code yourself to decide how much salt to take with them.
The sizes of the distributions are very close. NQXML has about 1400 non-blank, non-comment lines of code; REXML 18234
The last thing is the API, and this is where I think REXML wins, hands down. The core API is clean and intuitive, and things work the way you would expect them to. Convenience methods abound, and you can code for either convenience or speed. REXML code is terse, and readable, like Ruby code should be. The best way to decide which you like more is to write a couple of small applications in each, then use the one you're more comfortable with.
It should be noted that NQXML does not support XPath searches.
Here is the status of the XPath implementation.
/ root . self .. parent * all element children // all elements in document //child all "child" elements in document parent//child all "child" descendants of child element "parent" parent/child all "child" elements of "parent" [...] all predicates (attribute, index, text) [...][...] compound predicates element child element "element" function() (partially) axe:: (partially)
Some of this API (the API dealing with function() handling, in particular) is subject to change.
Namespace support is now fairly stable. One thing to be aware of is that REXML is not (yet) a validating parser. This means that some invalid namespace declarations are not caught.
I've had help from a number of resources; if I haven't listed you here, it means that I just haven't gotten around to adding you, or that I'm a dork and have forgotten. In either case, feel free to write me and complain. I may ignore you, but at least you tried. (Actually, I don't conciously ignore anybody except spammers.)