The End of Term Web Archive is a project to preserve public government websites and data that are at risk of being removed during the transition from one US administration to another. The Federal government has produced a great many websites and resources, and the process of archiving them takes weeks and months. The goal of current data rescue efforts is to identify the most urgent cases so that they get archived sooner. "Urgent cases" are those that the incoming administration may be particularly antagonistic towards.
While we want as much as possible to go into the Internet Archive, some of the open data resources made available by different government agencies are in formats that the Internet Archive is not designed to handle technologically. These data need to be downloaded separately, packaged up in a container along with a description of what the data is, and uploaded in other open repositories designed specifically for data archiving.
A webcrawler is a program that visits a web page, stores a copy, then examines the page looking for links to other web pages, follows those links, and repeats the process for every new page found. It is useful to understand a little bit about how crawlers work and what their limitations are.
The Seeders and Sorters team canvases the resources of a given government agency, identifying important URLs. URLs are nominated (equivalently, "seeded") using the Chrome extension or bookmarklet developed for this purpose. Each URL is added to a spreadsheet.
The reason for a human in the loop is this: the technological limitations of today's webcrawlers means that not everything can be automatically downloaded, so humans are needed to sort pages by whether they can be completely captured automatically. This sorting is really only provisional: when in doubt, seeders mark a URL as not crawlable, and humans in the next step of the workflow (the researchers) take a closer look at the "uncrawlables".
-
Agency forecasts developed by EDGI – we are focusing on DOE for seeding, and a set of uncrawlable resources identified by past events. In this UCLA event, our seeding and sorting goal will be Department of Energy (DOE) sites. We will prioritize (1) Office of Energy Efficiency and Renewable Energy, (2) Office of Science, (3) Energy Information Administration, (4) Federal Energy Regulatory Commission, and (5) National Renewable Energy Laboratory, in that order.
-
Assignment tracking spreadsheet – so that people know which agency sub-pages are already being worked on.
Researchers take a closer look at URLs that seeers and sorters flagged as possibly not crawlable. This activity requires more familiarity with HTML, JavaScript, the types of resources that might be encountered on the web, and how the web works in general.
For many people and events, the researchers and harvesters overlap considerably. For this reason, they work off the same list of URLs. In our event, we are using a subset of uncrawlable URLs that was determined to be higher priority and already confirmed to be uncrawlable. This list of URLs is in a spreadsheet listed in the next step below.
Harvesters take the "uncrawlable" content and try to figure out how to capture it. This is often a complex task that can require substantial technical expertise, and often requires different techniques for different types of content. A good description of the process has been put together by the DataRefuge group at UPenn in their toolkit:
-
The uncrawlables work list – this is actually a subset, determined by Philly's DataRefuge group to have high priority
The steps above are part of a larger workflow still under active development by several groups. The workflow tries to address the different kinds of content and websites that can be encountered. Currently, the clearest articulation of that workflow is the following documentation developed by the UPenn group:
EDGI (Environmental Data & Governance Initiative) is an international effort to develop tools, research networks, and initiatives to archive public environmental data proactively. Together with DataRefuge, an effort from the University of Pennsylvania Program in the Environmental Humanities, they have been organizing data rescue events since December 2016. We are using the workflows and tools they have been evolving.