Stateful programmatic web browsing in Python, after Andy Lester's Perl
module WWW::Mechanize
.
mechanize.Browser
is a subclass of
mechanize.UserAgent
, which is, in turn, a subclass of
ClientCookie.OpenerDirector
(like
urllib2.OpenerDirector
) (so any URL can be opened, not just
http:
). mechanize.UserAgent
offers easy dynamic
configuration of user-agent features like protocol, cookie, redirection and
robots.txt
handling, without having to make a new
OpenerDirector
each time, eg. by calling
build_opener()
(it's not stable yet, though).
.back()
and .reload()
methods).
Referer
HTTP header is added properly (optional).
robots.txt
.
mechanize.Browser
/ ClientForm API is
not sufficient.
An example:
import re from mechanize import Browser b = Browser() b.open("http://www.example.com/") # follow second link with element text matching regular expression response = b.follow_link(text_regex=re.compile(r"cheese\s*shop"), nr=1) assert b.viewing_html() print b.title() print response.geturl() print response.info() # headers print response.read() # body response.close() b.select_form(name="order") # Browser passes through unknown attributes (including methods) # to the selected HTMLForm (from ClientForm). b["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__) response2 = b.submit() # submit current form response3 = b.back() # back to cheese shop # the history mechanism uses cached requests and responses assert response3 is response # we can still use the response, even though we closed it: response3.seek(0) response3.read() response4 = b.reload() assert response4 is not response3 for form in b.forms(): print form # .links() optionally accepts the keyword args of .follow_/.find_link() for link in b.links(url_regex=re.compile("python.org")): print link b.follow_link(link) # takes EITHER Link instance OR keyword args b.back()
Full documentation is in the docstrings.
Thanks to Ian Bicking, for persuading me that a UserAgent
class
would be useful.
.response()
method (each call should return independent
pointer to same data).
urllib2
or ClientCookie
(currently depends on latter: just a matter of deciding on a way to specify
this).
mechanize.UserAgent
.
All documentation (including this web page) is included in the distribution.
This is an alpha release: interfaces may change, and there will be bugs.
Development release.
For installation instructions, see the INSTALL file included in the distribution.
Richard Jones' webunit (this is not the same as Steven Purcell's code of the same name). webunit and mechanize are quite similar. On the minus side, webunit is missing things like browser history, high-level forms and links handling, thorough cookie handling, refresh redirection, adding of the Referer header, observance of robots.txt and easy extensibility. On the plus side, webunit has a bunch of utility functions bound up in its WebFetcher class, which look useful for writing tests (though they'd be easy to duplicate using mechanize). In general, webunit has more of a frameworky emphasis, with aims limited to writing tests, where mechanize and the modules it depends on try hard to be general-purpose libraries.
There are many related links in the General FAQ page, too.
2.2 or above.
ClientCookie 0.4.19 or newer (note the required version!), ClientForm 0.1.x, and pullparser 0.0.4b or newer.
The BSD license (included in distribution).
John J. Lee, January 2005.