Data Refuge Updates

Data Refuge and #DataRescuePhilly

What is Data Refuge?

In conversation with many partners, such as you, we can build refuge for federal climate and environmental data vulnerable under an administration that denies the fact of ongoing climate change. We are committed to fact-based arguments. Data Refuge works to preserve the facts we all need.

How are We Building Data Refuge?

Data Refuge makes and preserves research-quality copies of federal climate and environmental data to keep them easily accessible to the research community and public. Date Refuge also promotes data literacy and teaches about the inherent instability of internet content. And, it fosters environmental and climate literacy, connecting data and the people and communities who produce and use data--or wish they had better data. Our work is based on a core best practice in data management: Lots of Copies Keep Stuff Safe (LOCKSS). This practice is reflected in our logo: the EPA logo padlocked.

Who Manages Data Refuge?

The project was launched by the Penn Program in the Environmental Humanities in November 2016 and is co-organized by Laurie Allen (Penn Libraries) and Bethany Wiggin (Program Environmental Humanities) with assistance from Margaret Janz (Penn Libraries) and Patricia Kim (Program in Environmental Humanities). Data Refuge is built by the Penn Libraries and the Program in Environmental Humanities--with a lot of help and wonderful collaborations with partners at many other libraries and universities across North America. 

How much Data Refuge has been built?

In the weeks since late November 2016 when we launched Data Refuge, the project has worked successfully to:

  • Spawn six Data Rescues (Philadelphia, Chicago, Indianapolis, Los Angeles, Ann Arbor, New York) after the initial data rescue in Toronto in mid-December 2016. These rescues concentrate on 1) feeding the Internet Archive (IA), now also through a special account IA opened for Data Refuge; and 2) downloading and describing data,  “uncrawlables,” which cannot easily go into the Internet Archive and so go into

  • Generate public attention to the vulnerability of climate and environmental data in major national and international media outlets as well as in many smaller ones. A partial bibliography is available here. (On 1/19, Data Refuge aired in a feature on HBO’s Vice News Tonight.)
  • Build and support research collaboratives, including the Environmental Data Governance Initiative (EDGI) 

  • Feed (as of 1/18/2017) 7229 web sites into the Internet Archive

  • Capture (as of 1/18/2017) 1.5+ terabytes of uncrawlables; identify 1259 suspected uncrawlable data sets.

  • Design protocols to allow workflows to continue securely and with quality assurance even as collaborators are not in the same room or working at the same time

  • Participate in a project to track changes to federal environmental and climate websites, in partnership at Penn with the Price Lab for Digital Humanities

  • Prompt exploratory conversations among university research libraries and institutes to create a repository for copies of federal research data. Unlike the federal depository system to archive materials as they are pushed, this proposed repository would actively pull data from federal sites

  • Incubate projects to identify use cases of open climate and environmental data--or of that data's absence--and to tell their users’ stories

  • Spark course development at the intersection of environmental humanities and digital humanities, including courses to support the new Minor and Graduate Certificate in digital humanities at Penn

What happened at Data Rescue Philly in Van Pelt Library 1/13-1/14?

With participation from collaborators at Penn, in Philadelphia, and from American and Canadian universities and research libraries, we hosted some 250+ people over the two days’ events in Van Pelt Library: a teach-in, a Guides training session, a roundtable on climate and data value and vulnerability, an art installation (up until 2/9/17 in the Annenberg Center for Performing Arts) about the variety of data we need to understand the past, present, and possible futures of the Schuykill River, a day-long archive-a-thon, and two receptions for guests.


One month earlier, Environmental Humanities Fellows, Kevin Burke and Patricia Kim, attended the first Data Rescue event (12/17/2016) organized by Prof. Michelle Murphy at the University of Toronto. The Toronto rescue concentrated on identifying content on the EPA website for archiving in the Internet Archive (IA) by the End of Term (EoT) Harvest. The Philadelphia Data Rescue continued and expanded Toronto’s collaboration with EoT and the IA. We also developed know-how, protocols, tools, and knowledge communities to feed the IA--this time with a focus on the NOAA website--and opened for the datasets and other web content that is not machine crawlable. Before the event, we developed a public survey--still circulating widely, thanks to help from the Union of Concerned Scientists--to understand the value and vulnerability of various data sets. If you haven't taken it, please do.

The Data Rescue events also included paths to document and develop the refuge and to conceive stories and visual materials about it, its data, "date rescuers," and the people and communities who use the data or might wish they had good data. Several projects combining tech know-how with documentation and storytelling skills are now underway with time horizons of the very near future, the first 100 days of the new administration, and the following months and years.

Some 130+ volunteers used our Refuge Field Guide to select one of six paths which Guides led them on. These paths were: Seeders, Baggers, Toolbuilders, Metadata, Storytellers & Documentation, and The Long Trail, which focused on thinking about the future of DataRefuge.

The “seeders,” led by an ace trio of Guides, got through 3,692 NOAA websites on Saturday. These Guides--Maya Anjur-Dietrich, Andrew Bergman, and Toly Rinberg (all Ph.D. Candidates in Applied Physics at Harvard University)--belong to the consortium of researchers in Environmental Data Governance Initiative (Wiggin is a member of EDGI’s steering committee). The group of roughly 40 seeders used agency primers and subprimers written by the trio and a Chrome extension developed initially by EDGI member Matt Price (Toronto). The workflow for seeders is available via EDGI here. Penn Libraries' account with Internet Archive expedites seeding the Internet Archive.

DataRescuePhilly focused on uncrawlable data and taught participants how to begin to identify what was crawlable or not. As this tutorial poster clarifies, crawlables could go into the Internet Archive; uncrawlables into

The "baggers," led by Justin Schell (University of Michigan Libraries), captured a lot of NOAA data--in the words tweeted out by event participant and roundtable panelist Robert Cheetham (CEO, Azavea), "jillions of bytes of data bagged and tagged today." Or in the words of Data Refuge's Co-ordinator, Laurie Allen, Assistant Director for Digital Scholarship at Penn Libraries:

"The folks who were downloading got 17 bags (bags = all of the various files made available through a page that a web harvester can’t access – they are often really hard to get). Of those 17, the first 8 are up in with light metadata. The next 9 will be up in the next couple of days. Those 17 bags combined are about 24 gigs, and another person got  ~1.5 terabytes on her own (she’s very awesome). That one will need some special attention."

The Toolbuilders, led by Guides including Toronto-based civic tech developer Brendan O’Brien, worked together with the Baggers to develop ways to get hard to grab data. Employees of Philly-based Azavea and members in the Philly Ope, Data community were key team members.

The Metadata team worked closely with the Baggers and Toolbuilders to develop a workflow for getting datasets from their original websites into in a way that ensure a documented chain of custody and provide enough descriptive information about the files to make them identifiable.

The Documentarists and Storytellers team, led by Guides including Naomi Waltham-Smith included participation from ten Penn faculty and students in the Environmental Humanities Program. They live-tweeted the event, developed a series of rescuer portraits, produced “Field Notes,” and took photographs that have been picked up in publications including Wired Magazine. Outgoing Presidential Innovation Fellow Denice Ross, an open data policy expert, was also a member of this group and is developing a project on use cases (more below). EDGI collaborator Rebecca Lave (Indiana University) also worked in this group, in collaboration with the Seeders, to further her work on a big project to monitor and track websites and to write reports about changes to pages and their implications (more below).

The Long Trail explored how Data Refuge can pivot post-rescue events to advocate for climate and environmental research and to plan additional public engagement projects.

Breaking News after #DataRescuePhilly

On 1/17/17, reported that the transition team was ready “to scrub some climate data from the EPA site.” We began a social media campaign to draw attention to this report.  

The following day, Michael Halpern (Union of Concerned Scientists) wrote to us to say, “We are hearing that many non-regulatory federal climate web pages will come down as early as Friday. Sites such as this. … I’m not talking about data here, but about the websites.” Such fears were again reported on 1/18/17 by the New York Times.

As news spread that websites might be in immediate jeopardy, Data Refuge, EDGI, and Internet Archive worked through the nights to seed the outstanding URLs we know about and develop more efficient ways to get them into IA. Data Refuge now has a special account to feed IA directly and is actively doing so.

Additionally, Michael Riedijk (CEO, volunteered to have his firm archive the complete domains list developed by EDGI in need of tracking and monitoring. In private correspondence with Data Refuge + EDGI, Riedijk said he will “run and store it [the list] from our European data center.”

And, we have developed a set of protocols to manage workflow for the long list of Uncrawlables identified but not yet downloaded and put into

DataRescues Expand Data Refuge

Three more data rescue events are planned for the week leading up to the inauguration, with an additional two scheduled, and at least one more (in Cambridge/Boston) in the planning stages

  • Chicago hosted their data rescue event on 1/17. It was organized by Karl Blumenthal (Internet Arcgive). You can learn more about their plans.

  • DataRescueIndy is on 1/19, seeding the Internet Archive with the End of Term Harvest Project and bringing uncrawlables into, was organized by Jason M. Kelly.

  • Los Angeles: 1/20, seeding the Internet Archive with the End of Term Harvest Project and bringing uncrawlables into, has been organized by Morgan Currie and others, including Mike Hucka, who helped out the DataRescuePhilly.

  • Ann Arbor: 1/27-28, seeding the Internet Archive with the End of Term Harvest Project and bringing uncrawlables into, has been organized by Justin Schell, who helped out with DataRescuePhilly, and others

  • New York: 2/4, seeding the Internet Archive with the End of Term Harvest Project and bringing uncrawlables into, is being organized by Jerome Whittington (NYU/EDGI) and others