Chicago Data Rescue

Guest Post by: Karl-Rainer Blumenthal (landscapelibrarian.com)

If you’re wondering if there’s a team, an event, or a dataset too small to make a difference to the Data Refuge effort, I’m here to say: nope! Now more than ever, with the winds of resistance at our backs, everyone can make a vital impact to sustain the climate and environmental sciences.

Take DataRescueChicago as the example. Following the events in Toronto and Philadelphia, a small group of attendees of the weekly civic hacking event Chi Hack Night organized around the need and opportunity to contribute. While we had precious few hours to plan and act before the inauguration, we were ultimately able to feed hundreds of URLs into the End of Term web archiving project’s pipeline, back-up a few technically complex resources on our own, and test out the tools and workflows that future events could use to their same benefit.

We were a well-oiled if hastily assembled machine. We had four (count ‘em: four) volunteers assembly-lining the “seeding” and collecting that night. Given that size and the little time we had to work with though, I think we did some real good! Here’s a quick summary of how we worked that should be exportable and scalable to any event:

Seeding

Objective #1 of our short time together was to add to the list on online data sources to be archived by the End of Term project. Our colleagues in Toronto had done the pioneering work here with special regards to identifying and “nominating” the relevant URLs from the Environmental Protection Agency (EPA), and in Philadelphia the National Oceanographic and Atmospheric Administration (NOAA). We took a different, less agency-specific approach in the hopes that we could help to keep anything from falling into the cracks among them. One volunteer who was working concurrently with the complementary Climate Mirror project brought with him URLs identified as especially important to the work of the Natural Resources Defense Council (NRDC), including data sets, reports, and tools from the above agencies as well as the Agriculture, Interior, and Energy Departments, among others. These URLs formed the foundation of our “seed list” for the End of Term project, and launched us into our other activities.

Scraping

To further suture inter-agency resources into the forthcoming archive, and to identify resources in need of special levels of attention beyond web crawling (more on that in a minute), we scraped some of our “starter pack” of URLs and those that we could glean from open access journal articles and

National Science Foundation (NSF) grants related to climate research for further links. Initially with the help of some homegrown scripts, then the Python-based web scraping tool Scrapy, one of our volunteers scanned through the original seed list and identified links out to other .gov sites and pages. Ultimately, we produced a list of ~400 URLs to archive, which I delivered the following morning (phew!) to the End of Term team for introduction into their pipeline.

Recording

With the seed list assembled, our Climate Mirror colleague went about his work of crawling and storing large data resources like agency FTP sites with the open source web crawler Heritrix, one of the same tools used by the Internet Archive and its End of Term partners. In the meantime, two of us quickly looked through the original and scraped URLs in order to identify materials that for their dynamic, user-responsive qualities might be challenging or even impossible for him or others to crawl with such an automated tool. Videos, executable files, interactive maps and web apps, and more, immediately earned “Needs attention” tags.

One kind of special attention that they got: live web recording in place of crawling. Using the free and open source Webrecorder tool, which records what a human user sees as s/he browses, we created about a dozen WARC (Web ARChive) files for web pages and their associated data. We especially liked this workflow because it ensured that end-user patrons of the archived data would not only see datasets in .csv, .zip, and various other forms, but also retrieve them through the same surrounding context of the original web page from which they were served.

Here’s a very simple example: take for instance these “Energy Data for Decision Makers” published by the National Renewable Energy Laboratory (NREL).

Even after cutting my WiFi cord, I can open the WARC file (with Web Archive Player -- like running your own Wayback Machine!) and access the referenced PDF files in all of their original web context. The same is possible with all sorts of downloadable, executable, and interactive data served through a web browser.

Storing

Preserving that kind of original context and the chain of custody from live web to archive is critical towards making saved data reliably citation-ready to scientists and researchers. The volunteers at Penn and Harvard have done great work to systematize the steps and tools necessary to “bag” all such provenance information right alongside the original data. In order to get started before the inauguration though, our effort in Chicago had to get items into some kind of protective storage before that process could be completely formalized.

So we did what anyone and everyone can do: immediately upload our data to the redundant storage servers of the Internet Archive. We started uploading our WARCs one-at-a-time with the Internet Archive’s browser-based upload tool. Once we started backfilling our repository with the larger WARC files from the Climate Mirror crawls (an ongoing process -- these are some ~big~ data!), we found it even more expedient to use the Internet Archive’s Python package and/or command line tool to upload in bulk.

While we store these data in the Internet Archive until they may be added to Data Refuge’s own server space, the process could very well work in both directions. The Python and command line tools above can be used to quickly and easily add a layer of redundancy to Data Refuge’s data,  making preservation copies in the Internet Archive’s data centers in the US and (eventually) Canada.  

...and Repeating

While I regret that we couldn't marshal the resources in time to match the amazing efforts that preceded us, I can't overstate the spirit of the mighty little team that came together here in Chicago. With still more organizers, events, and data out there, I hope that the above can help to jumpstart the next effort; to pass the baton to the next teammate in this continental relay race. If you use and improve the tools and workflows above, let us know!