A ‘Web Scraper’ is a program that attempts to make a webpage or website accessible, in full, to other programs. For instance, governments often spend energy publishing their databases as ‘data portals’ - showing one page of police or election data at a time. This is perfectly fine for the majority of people visiting the site, but if your’re trying to do analysis like aggregation or visualization, or if you’re trying to download your own data in bulk, it’s counterproductive and frustrating.
And so web scrapers do the task of a very industrious user - they navigate websites, parse pages, and save information in bulk.
In the course of being a so-called data-centered programmer and a somewhat crazy person, I’ve written a lot of these scripts. They scrape things like Yelp, GitHub, Twitter, Garmin Connect, and weirder stuff like Afghanistan Election sites, public imagery sources, and more.
There are a few things that have stood out in all of this experience.
There are two things about browsers which are valuable and hard to replicate: strong, tolerant HTML parsing, and the sessions/cookies/browserstring identification paradigm.
The web is filled with invalid HTML, which is in turn invalid XML. Browsers tolerate it, and users write it.
I support Marc completely in his decision to make Mosaic work as best it can when it is given invalid HTML. The maxim is that one should be
- conservative in what one does
- liberal in what one expects.
So, advanced scrapers try to be browsers. The historical example is Mechanize, in Ruby and Perl, which implements such features as a ‘cookie jar’ to act like a browser. The new kid on the block is PhantomJS, a headless WebKit that’s inches away from being a node.js extension.
Scrapers are weird but not narrow: they’re important for journalists, and they’re more and more a feature of ‘exporting’ data from morally wobbly sites.
What we think of as scrapers - Python or Ruby scripts - aren’t usable for most users, who will minimally be wary of the Terminal, and mostly be using Windows.
My main complaint with systems like ScraperWiki
is that it takes data from
x difficult system and puts it into a new one,
with its own API. The data that you get out of scrapers should be simple -
that means either CSV or JSON, with the emphasis
on JSON because it is a real standard with plenty of high-quality parsers.
My choice of language and environment for web scraping has always been driven by the package - by what XML parser and HTTP request toolchain was available. Sometimes it’s Python, sometimes it’s Ruby, and recently it has been node.js with the cheerio module.
It’s not cute. Firefox is lagging behind Chrome most of the time nowadays, but Greasemonkey is the best augmented browsing software around, and has a great community for that code.
And so the approach is as such:
That’s it. To review the finer points:
happen is a library I wrote for creating
jQuery, but mind that it’s not real - that code simply executes handlers.
If you haven’t used jQuery to bind those events, they can’t be triggered with
happen will trigger all events always.
A relevant example of happen in a scraper:
localStorage is a slow but steady storage API. You can save data in your browser itself with it, with a super-simple interface. For instance,
Then, when you’ve stored your data into one big object, just enter in your FireBug console:
Which gets the current data dump and downloads it to your system as JSON - easy as that.
This technique has a few awesome benefits. You can prototype your scrapers in FireBug, getting really fast feedback on how to select, clean, and store stuff from the page. Other people who know jQuery can read the script and get what it does - it’s just jQuery.
It can handle crazy cases like sites that need login - which otherwise are a pain to script.
It’s not blazingly fast, but, of course, scraping is ‘fast’ when it’s ‘not a week of some intern’s time’.