Anemone is a web spider framework
that can spider a domain and collect useful information about the pages it
visits. It is versatile, allowing you to write your own specialized spider
tasks quickly and easily.
See anemone.rubyforge.org for
more information.
Features
- Multi-threaded design for high performance
- Tracks 301 HTTP redirects to understand a page‘s aliases
- Built-in BFS algorithm for determining page depth
- Allows exclusion of URLs based on regular expressions
- Choose the links to follow on each page with focus_crawl()
- HTTPS support
- Records response time for each page
- CLI program can list all pages in a domain, calculate page depths, and more
- Obey robots.txt
- In-memory or persistent storage of pages during crawl, using TokyoCabinet
or PStore
Examples
See the scripts under the lib/anemone/cli directory for examples
of several useful Anemone tasks.
Requirements