Home > Dive Into Python > XML Processing > Abstracting input sources | << >> | ||||
diveintopython.org Python for experienced programmers |
One of Python's greatest strengths is its dynamic binding, and one powerful use of dynamic binding is the file-like object.
Many functions which require an input source could simply take a filename, go open the file for reading, read it, and close it when they're done. But they don't. Instead, they take a file-like object.
In the simplest case, a file-like object is any object with a read method with an optional size parameter, which returns a string. When called with no size parameter, it reads everything there is to read from the input source and returns all the data as a single string. When called with a size parameter, it reads that much from the input source and returns that much data; when called again, it picks up where it left off and returns the next chunk of data.
This is how reading from real files works; the difference is that we're not limiting ourselves to real files. The input source could be anything: a file on disk, a web page, even a hard-coded string. As long as we pass a file-like object to the function, and the function simply calls the object's read method, the function can handle any kind of input source without specific code to handle each kind.
In case you were wondering how this relates to XML processing, minidom.parse is one such function which can take a file-like object.
Example 5.25. Parsing XML from a file
>>> from xml.dom import minidom >>> fsock = open('binary.xml')>>> xmldoc = minidom.parse(fsock)
>>> fsock.close()
>>> print xmldoc <?xml version="1.0" ?> <grammar> <ref id="bit"> <p>0</p> <p>1</p> </ref> <ref id="byte"> <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\ <xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p> </ref> </grammar>
![]() | First, we open the file on disk. This gives us a file object. |
![]() | We pass the file object to minidom.parse, which calls the read method of fsock and reads the XML document from the file on disk. |
![]() | Be sure to call the close method of the file object after we're done with it. minidom.parse will not do this for you. |
Well, that all seems like a colossal waste of time. After all, we've already seen that minidom.parse can simply take the filename and do all the opening and closing nonsense automatically. And it's true that if you know you're just going to be parsing a local file, you can pass the filename and minidom.parse is smart enough to Do The Right Thing™. But notice how similar -- and easy -- it is to parse an XML document straight from the Internet.
Example 5.26. Parsing XML from a URL
>>> import urllib >>> usock = urllib.urlopen('http://slashdot.org/slashdot.rdf')>>> xmldoc = minidom.parse(usock)
>>> usock.close()
>>> print xmldoc.toxml()
<?xml version="1.0" ?> <rdf:RDF xmlns="http://my.netscape.com/rdf/simple/0.9/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <channel> <title>Slashdot</title> <link>http://slashdot.org/</link> <description>News for nerds, stuff that matters</description> </channel> <image> <title>Slashdot</title> <url>http://images.slashdot.org/topics/topicslashdot.gif</url> <link>http://slashdot.org/</link> </image> <item> <title>To HDTV or Not to HDTV?</title> <link>http://slashdot.org/article.pl?sid=01/12/28/0421241</link> </item> [...snip...]
![]() | As we saw in the previous chapter, urlopen takes a web page URL and returns a file-like object. Most importantly, this object has a read method which returns the HTML source of the web page. |
![]() | Now we pass the file-like object to minidom.parse, which obediently calls the read method of the object and parses the XML data that the read method returns. The fact that this XML data is now coming straight from a web page is completely irrelevant. minidom.parse doesn't know about web pages, and it doesn't care about web pages; it just knows about file-like objects. |
![]() | As soon as you're done with it, be sure to close the file-like object that urlopen gives you. |
![]() | By the way, this URL is real, and it really is XML. It's an XML representation of the current headlines on Slashdot, a technical news and gossip site. |
Example 5.27. Parsing XML from a string (the easy but inflexible way)
>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>" >>> xmldoc = minidom.parseString(contents)>>> print xmldoc.toxml() <?xml version="1.0" ?> <grammar><ref id="bit"><p>0</p><p>1</p></ref></grammar>
OK, so we can use the minidom.parse function for parsing both local files and remote URLs, but for parsing strings, we use... a different function. That means that if we want to be able to take input from a file, a URL, or a string, we'll need special logic to check whether it's a string, and call the parseString function instead. How unsatisfying.
If there were a way to turn a string into a file-like object, then we could simply pass this object to minidom.parse. And in fact, there is a module specifically designed for doing just that: StringIO.
Example 5.28. Introducing StringIO
>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>" >>> import StringIO >>> ssock = StringIO.StringIO(contents)>>> ssock.read()
"<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>" >>> ssock.read()
'' >>> ssock.seek(0)
>>> ssock.read(15)
'<grammar><ref i' >>> ssock.read(15) "d='bit'><p>0</p" >>> ssock.read() '><p>1</p></ref></grammar>' >>> ssock.close()
Example 5.29. Parsing XML from a string (the file-like object way)
>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>" >>> ssock = StringIO.StringIO(contents) >>> xmldoc = minidom.parse(ssock)>>> print xmldoc.toxml() <?xml version="1.0" ?> <grammar><ref id="bit"><p>0</p><p>1</p></ref></grammar>
So now we know how to use a single function, minidom.parse, to parse an XML document stored on a web page, in a local file, or in a hard-coded string. For a web page, we use urlopen to get a file-like object; for a local file, we use open; and for a string, we use StringIO. Now let's take it one step further and generalize these differences as well.
def openAnything(source):# try to open with urllib (if source is http, ftp, or file URL) import urllib try: return urllib.urlopen(source)
except (IOError, OSError): pass # try to open with native open function (if source is pathname) try: return open(source)
except (IOError, OSError): pass # assume source is string import StringIO return StringIO.StringIO(source)
Now we can use this openAnything function in conjunction with minidom.parse to make a function that takes a source that refers to an XML document somehow (either as a URL, or a local filename, or a hard-coded XML document in a string) and parses it.
Accessing element attributes | 1 2 3 4 5 6 7 8 9 | Standard input, output, and error |