Data Refuge Updates

Data Refuge and #DataRescuePhilly


What is Data Refuge?

In conversation with many partners, such as you, we can build refuge for federal climate and environmental data vulnerable under an administration that denies the fact of ongoing climate change. We are committed to fact-based arguments. Data Refuge works to preserve the facts we all need.

How are We Building Data Refuge?

Data Refuge makes and preserves research-quality copies of federal climate and environmental data to keep them easily accessible to the research community and public. Date Refuge also promotes data literacy and teaches about the inherent instability of internet content. And, it fosters environmental and climate literacy, connecting data and the people and communities who produce and use data--or wish they had better data. Our work is based on a core best practice in data management: Lots of Copies Keep Stuff Safe (LOCKSS). This practice is reflected in our logo: the EPA logo padlocked.

Who Manages Data Refuge?

The project was launched by the Penn Program in the Environmental Humanities in November 2016 and is co-organized by Laurie Allen (Penn Libraries) and Bethany Wiggin (Program Environmental Humanities) with assistance from Margaret Janz (Penn Libraries) and Patricia Kim (Program in Environmental Humanities). Data Refuge is built by the Penn Libraries and the Program in Environmental Humanities--with a lot of help and wonderful collaborations with partners at many other libraries and universities across North America. 

How much Data Refuge has been built?

In the weeks since late November 2016 when we launched Data Refuge, the project has worked successfully to:

  • Spawn six Data Rescues (Philadelphia, Chicago, Indianapolis, Los Angeles, Ann Arbor, New York) after the initial data rescue in Toronto in mid-December 2016. These rescues concentrate on 1) feeding the Internet Archive (IA), now also through a special account IA opened for Data Refuge; and 2) downloading and describing data,  “uncrawlables,” which cannot easily go into the Internet Archive and so go into www.datarefuge.org

  • Generate public attention to the vulnerability of climate and environmental data in major national and international media outlets as well as in many smaller ones. A partial bibliography is available here. (On 1/19, Data Refuge aired in a feature on HBO’s Vice News Tonight.)
  • Build and support research collaboratives, including the Environmental Data Governance Initiative (EDGI) 

  • Feed (as of 1/18/2017) 7229 web sites into the Internet Archive

  • Capture (as of 1/18/2017) 1.5+ terabytes of uncrawlables; identify 1259 suspected uncrawlable data sets.

  • Design protocols to allow workflows to continue securely and with quality assurance even as collaborators are not in the same room or working at the same time

  • Participate in a project to track changes to federal environmental and climate websites, in partnership at Penn with the Price Lab for Digital Humanities

  • Prompt exploratory conversations among university research libraries and institutes to create a repository for copies of federal research data. Unlike the federal depository system to archive materials as they are pushed, this proposed repository would actively pull data from federal sites

  • Incubate projects to identify use cases of open climate and environmental data--or of that data's absence--and to tell their users’ stories

  • Spark course development at the intersection of environmental humanities and digital humanities, including courses to support the new Minor and Graduate Certificate in digital humanities at Penn

What happened at Data Rescue Philly in Van Pelt Library 1/13-1/14?

With participation from collaborators at Penn, in Philadelphia, and from American and Canadian universities and research libraries, we hosted some 250+ people over the two days’ events in Van Pelt Library: a teach-in, a Guides training session, a roundtable on climate and data value and vulnerability, an art installation (up until 2/9/17 in the Annenberg Center for Performing Arts) about the variety of data we need to understand the past, present, and possible futures of the Schuykill River, a day-long archive-a-thon, and two receptions for guests.

 

One month earlier, Environmental Humanities Fellows, Kevin Burke and Patricia Kim, attended the first Data Rescue event (12/17/2016) organized by Prof. Michelle Murphy at the University of Toronto. The Toronto rescue concentrated on identifying content on the EPA website for archiving in the Internet Archive (IA) by the End of Term (EoT) Harvest. The Philadelphia Data Rescue continued and expanded Toronto’s collaboration with EoT and the IA. We also developed know-how, protocols, tools, and knowledge communities to feed the IA--this time with a focus on the NOAA website--and opened www.datarefuge.org for the datasets and other web content that is not machine crawlable. Before the event, we developed a public survey--still circulating widely, thanks to help from the Union of Concerned Scientists--to understand the value and vulnerability of various data sets. If you haven't taken it, please do.

The Data Rescue events also included paths to document and develop the refuge and to conceive stories and visual materials about it, its data, "date rescuers," and the people and communities who use the data or might wish they had good data. Several projects combining tech know-how with documentation and storytelling skills are now underway with time horizons of the very near future, the first 100 days of the new administration, and the following months and years.

Some 130+ volunteers used our Refuge Field Guide to select one of six paths which Guides led them on. These paths were: Seeders, Baggers, Toolbuilders, Metadata, Storytellers & Documentation, and The Long Trail, which focused on thinking about the future of DataRefuge.

The “seeders,” led by an ace trio of Guides, got through 3,692 NOAA websites on Saturday. These Guides--Maya Anjur-Dietrich, Andrew Bergman, and Toly Rinberg (all Ph.D. Candidates in Applied Physics at Harvard University)--belong to the consortium of researchers in Environmental Data Governance Initiative (Wiggin is a member of EDGI’s steering committee). The group of roughly 40 seeders used agency primers and subprimers written by the trio and a Chrome extension developed initially by EDGI member Matt Price (Toronto). The workflow for seeders is available via EDGI here. Penn Libraries' account with Internet Archive expedites seeding the Internet Archive.

DataRescuePhilly focused on uncrawlable data and taught participants how to begin to identify what was crawlable or not. As this tutorial poster clarifies, crawlables could go into the Internet Archive; uncrawlables into www.datarefuge.org.

The "baggers," led by Justin Schell (University of Michigan Libraries), captured a lot of NOAA data--in the words tweeted out by event participant and roundtable panelist Robert Cheetham (CEO, Azavea), "jillions of bytes of data bagged and tagged today." Or in the words of Data Refuge's Co-ordinator, Laurie Allen, Assistant Director for Digital Scholarship at Penn Libraries:

"The folks who were downloading got 17 bags (bags = all of the various files made available through a page that a web harvester can’t access – they are often really hard to get). Of those 17, the first 8 are up in datarefuge.org with light metadata. The next 9 will be up in the next couple of days. Those 17 bags combined are about 24 gigs, and another person got  ~1.5 terabytes on her own (she’s very awesome). That one will need some special attention."

The Toolbuilders, led by Guides including Toronto-based civic tech developer Brendan O’Brien, worked together with the Baggers to develop ways to get hard to grab data. Employees of Philly-based Azavea and members in the Philly Ope, Data community were key team members.

The Metadata team worked closely with the Baggers and Toolbuilders to develop a workflow for getting datasets from their original websites into datarefuge.org in a way that ensure a documented chain of custody and provide enough descriptive information about the files to make them identifiable.

The Documentarists and Storytellers team, led by Guides including Naomi Waltham-Smith included participation from ten Penn faculty and students in the Environmental Humanities Program. They live-tweeted the event, developed a series of rescuer portraits, produced “Field Notes,” and took photographs that have been picked up in publications including Wired Magazine. Outgoing Presidential Innovation Fellow Denice Ross, an open data policy expert, was also a member of this group and is developing a project on use cases (more below). EDGI collaborator Rebecca Lave (Indiana University) also worked in this group, in collaboration with the Seeders, to further her work on a big project to monitor and track websites and to write reports about changes to pages and their implications (more below).

The Long Trail explored how Data Refuge can pivot post-rescue events to advocate for climate and environmental research and to plan additional public engagement projects.

Breaking News after #DataRescuePhilly

On 1/17/17,  InsideEpa.com reported that the transition team was ready “to scrub some climate data from the EPA site.” We began a social media campaign to draw attention to this report.  

The following day, Michael Halpern (Union of Concerned Scientists) wrote to us to say, “We are hearing that many non-regulatory federal climate web pages will come down as early as Friday. Sites such as this. … I’m not talking about data here, but about the websites.” Such fears were again reported on 1/18/17 by the New York Times.

As news spread that websites might be in immediate jeopardy, Data Refuge, EDGI, and Internet Archive worked through the nights to seed the outstanding URLs we know about and develop more efficient ways to get them into IA. Data Refuge now has a special account to feed IA directly and is actively doing so.

Additionally, Michael Riedijk (CEO, Pagefreezer.com) volunteered to have his firm archive the complete domains list developed by EDGI in need of tracking and monitoring. In private correspondence with Data Refuge + EDGI, Riedijk said he will “run and store it [the list] from our European data center.”

And, we have developed a set of protocols to manage workflow for the long list of Uncrawlables identified but not yet downloaded and put into www.datarefuge.org.

DataRescues Expand Data Refuge

Three more data rescue events are planned for the week leading up to the inauguration, with an additional two scheduled, and at least one more (in Cambridge/Boston) in the planning stages

  • Chicago hosted their data rescue event on 1/17. It was organized by Karl Blumenthal (Internet Arcgive). You can learn more about their plans.

  • DataRescueIndy is on 1/19, seeding the Internet Archive with the End of Term Harvest Project and bringing uncrawlables into DataRefuge.org, was organized by Jason M. Kelly.

  • Los Angeles: 1/20, seeding the Internet Archive with the End of Term Harvest Project and bringing uncrawlables into DataRefuge.org, has been organized by Morgan Currie and others, including Mike Hucka, who helped out the DataRescuePhilly.

  • Ann Arbor: 1/27-28, seeding the Internet Archive with the End of Term Harvest Project and bringing uncrawlables into DataRefuge.org, has been organized by Justin Schell, who helped out with DataRescuePhilly, and others

  • New York: 2/4, seeding the Internet Archive with the End of Term Harvest Project and bringing uncrawlables into DataRefuge.org, is being organized by Jerome Whittington (NYU/EDGI) and others






 

 

 

.

DataRescue Philly Builds DataRefuge

Updated:

Over the course of the two-day DataRescue Philly event, 250+ people attended. We are very grateful for so many motivated, determined, and--above all--generous volunteers and collaborators. Thanks to you all.

The seeders and sorters (explained below)--led by Data Refuge Guides Maya Anjur-Dietrich, Andrew Bergman, and Toly Rinberg--got through 3,692 NOAA websites on Saturday. The "baggers," led by Justin Schell, captured a lot of NOAA data--in the words tweeted out by event participant Robert Cheetham (CEO Azavea), "jillions of bytes of data bagged and tagged today." Or in the words of Data Refuge's Co-ordinator, Laurie Allen, Assistant Director for Digital Scholarship at Penn Libraries:

The folks who were downloading got 17 bags (bags = all of the various files made available through a page that a web harvester can’t access – they are often really hard to get). Of those 17, the first 8 are up in datarefuge.org with light metadata. The next 9 will be up in the next couple of days. Those 17 bags combined are about 24 gigs, and another person got  ~1.5 terabytes on her own (she’s very awesome). That one will need some special attention.

A diverse group of participants from various backgrounds and with different skills came together over the course of two-days to contribute to the project. We had a full house for the kick-off Teach-in, double the number of Guides we expected (we ran out of our Guides tee-shirts!), and the panel discussion to close day one drew a storm of questions. For six portraits of participants on day 2, check these blog posts (Part1 and Part2) by Program Fellow Kaushik Ramu with photography by Faculty Working Group Member Naomi Waltham-Smith, Guide for the documentation group. Program Coordinator Patricia Kim aggregated the many tweets posted throughout the events in a series of four "Field Notes" (1, 2, 3, 4 on the Fellows Blog).   

Maya Anjur-Dietrich, guiding participants on seeding and sorting NOAA sites. Photography by Naomi Waltham-Smith. 

Maya Anjur-Dietrich, guiding participants on seeding and sorting NOAA sites. Photography by Naomi Waltham-Smith. 

For the second day of DataRescue Philly, we asked participants to choose one of six paths into the Refuge, each led by one or more Guides trained on the first day. (We trained almost 50 Guides!) These paths were: 

  • the Seeders/Sorters: who nominated urls to seed the End of Term Harvest project so that these sites would be machine crawled and put into the Internet Archive AND who sorted out the data (pages, datasets, query tools, etc etc) which can't be machine crawled and must be captured by other means.  
  • the "Baggers:" who captured data that couldn't go easily or at all into IA, figured out a work flow (check it out here), devised ways to get these ornery and often very large materials on a case by case basis. Once they had it, they "bagged" it so it could then be moved into our CKAN instance, the data refuge. Check out the datasets--from NOAA, the Department of Energy, EPA--already in the refuge. 
  • the "Tool Builders:" who helped the baggers with especially tricky captures
  • the "Metadata" team of archivists and librarians: who worked with the baggers and tool builders to describe those data sets  
  • the Documentation and Storytellers: who (as quoted above) wrote and visually documented stories from the Data Refuge. Several longer-term individual and collaborative storytelling projects were devised, including one with former White House Presidential Fellow Denice Ross, a leader in the open data initiative. It will cast storylines between data and the many and diverse people and communities who use them. 
  • the Long Trail: who thought together, also in conversation with the other groups, about how to grow data refuge after the "rescue" events: regular meet-ups to continue the work of seeding and sorting into the spring, projects with our collaborators in EDGI (Environmental Data Governance Initiative) to track changes in the websites of several federal agencies and prepare 100-day reports. 
Participants Sean Fern and Matt Zumwalt. Photography by Naomi Waltham-Smith. 

Participants Sean Fern and Matt Zumwalt. Photography by Naomi Waltham-Smith. 

NOAA was indeed the focus, especially of the Seeders and Sorters, and we tackled it using the results of the survey we circulated via the Union of Concerned Scientists and by using the agency primers and sub primers developed by our project partners from EDGI: Rinberg, Bergman, and Anjur-Dietrich (Guides of the Seeders and Sorters). We are eager to carry on this work with the team who worked so quickly 

From Toronto, we hosted Michelle Murphy, lead organizer of the "guerrilla archiving" event there in mid-December 2016, and a co-founder of EDGI. She spoke on the Data Value and Vulnerability roundtable on day one. The panel also included Jefferson Bailey (Director, Internet Archive), Robert Cheetham (President/CEO, Azavea), Michael Halpern (Deputy Director, Union of Concerned Scientists), and Sarah Wu (Deputy Director for Planning, Office of Sustainability, City of Philadelphia). The event was video recorded and will be publicly distributed once it has been lightly edited with some titling.

We were very happy to host organizers or others assisting with several upcoming #datarescue events: Rebecca Lave will help out with #DataRescueIndy organized by Jason Kelly; Mike Hucka who will help out with Los Angeles), Jerome Whitington (New York), and others. 

January 17, 2017 Chicago: #DataRescueChicago

January 19, 2017 Indianapolis: #DataRescueIndy

January 20, 2017 Los Angeles: #ProtectClimateData

Keep checking back for more information on other #DataRescue events and about how to expand Data Refuge. 

This coming Tuesday morning (January 17), at the weekly PPEH Fellows colloquium, a group will re-convene in Penn Libraries to continue on the Long Trail in consultation with our many partners and collaborators.
 

Photo Credit: Andrew Bergman. Co-organizers of DataRescue Philly after the Penn Libraries closed for the night: Margaret Janz, Patricia Kim, Laurie Allen, and Bethany Wiggin. Allen (Penn Libraries) and Wiggin (PPEH) co-ordinate Data Refuge. Good collaborators build DataRefuge.  

Photo Credit: Andrew Bergman. Co-organizers of DataRescue Philly after the Penn Libraries closed for the night: Margaret Janz, Patricia Kim, Laurie Allen, and Bethany Wiggin. Allen (Penn Libraries) and Wiggin (PPEH) co-ordinate Data Refuge. Good collaborators build DataRefuge.  

Meet the DataRescuers - Part 2

And.. it's about halfway through the day; snow's expected to start driftin' about anytime. Our DJ hasn't made it, but life goes on, and here are three random picks from the toolbuilders' huddle.

Max Furman

Max lives in Philly and works as a software programmer for a consulting firm. He's here because "all data is politicised", and because he wants "facts to be a weapon for the scientific community". He's working today on Chrome extensions that make downloading easy, and has just got hold of a compression utility tool, 'Keka'. His idea of the good life: a beach with table service and a copy of his last read, Invisible Planets, a science-fiction anthology. MOOD: "pensive".

Ariel Rodriguez

Ariel grew up in Cuba and works as a web developer in Miami. He's keen on salvaging data related to global warming, to all that governments do, and spent last evening in New York at the opera. He's wondering how all the websites crawled today will be used, to whom it might be useful in the future, and how best it can all be organized. MOOD: "very good".

 

Robin Schaufler

Robin's from the 'burbs of Philly, but has spent many years in Long Island, NY and Silicon Valley. She's a programmer, and thinks precarious data could be concieved on axes such as of risk, of criticality for earth-scientific research, and so on. She's worried about gaps in climate-data, and data on ice-cores from the Arctic and Antarctic. A good day, for Robin, is simply a productive day at work, and she clearly loves programming at its intersection with non-solely-for-profit work. MOOD: "focussed"

 

(Photographs by Kaushik Ramu)

Meet the DataRescuers- Part 1

KAUSHIK RAMU AND NAOMI WALTHAM-SMITH

Two hours into Day 2 of #datarefuge. Who are the people, #datarescuers, who've so generously volunteered for this project of archival? Here's a byte each from some seeders and sorters:

Bhairevi Aiyer

Bhairevi is a Masters student in the Environmental Science program at Penn. She grew up in Hyderabad, India, speaks Tamil, and listens to North Indian classical music.  She's concerned about environmental data, and data related to waste-management, but also about oral traditions and literary sources that might have scientific value.  Team: Seeding & Sorting. MOOD: "motivated".

 

 

 

 

Charles Haas

Charles is a Professor of Environmental Engineering at Drexel University. Grew up in New York City and the Midwest, and is here because data's needed for sound decision-making -- but he also has in mind other kinds of loss, such as those of history's burning libraries, as in Alexandria, and imagines monasteries that might have tucked papers away in secret... Especially cares about data related to water-pollution and water-quality. Team: Seeding & Sorting. MOOD: "apprehensive".

 

 

 

 

Ben Kim

Ben is a Penn undergrad from the Computer Science department. Grew up in San Jose, California, and is here because he cares about environmental data that bears on sustainability. Cares about the oceans, especially the depths of the ocean and all that we don't know about them. Team: Seeding & Sorting. MOOD: "motivated"

 

 

 

 

 

 

 

 

(Photographs by Naomi Waltham-Smith)