Crowdsourcing archiving

Today I discovered two things that have been around for a while but which are new to me.

Every now and again I’ve lent my computers’ spare cycles to projects like the Great Internet Mersenne Prime Search and SETI@home, both of which have been crowdsourcing scientific computing long before the term “crowdsourcing” became popular.  One of my discoveries today was a project that’s directly related to my professional interests: distributed archiving of websites that are about to go dark.

It all started when this came across my Twitter feed:

@textfiles Yes, you read right, Yahoo! is completely rate-limiting/temp-banning us from making copies of this data they're deleting. ZERG RUSH NEEDED

A Zerg rush on Yahoo?  Say what?  I had visited textfiles.com, an archive of hacker lore, in the past and knew that Jason Scott did interesting things, but had no idea what he was up to now.

It didn’t take much poking around to figure out what’s up.  Yahoo has announced that their Message Boards service is being discontinued at the end of the month.  Of course, there’s no lack of options for places on the web for folks to talk, although I wouldn’t be surprised to hear that there are a few niche communities on the boards that will have to scramble to find a new home.  What can’t be replaced, of course, are the past discussions — and those were made by the users of the service, not by Yahoo.  So far, it doesn’t sound like Yahoo is interested in providing an archive.

That’s where the Archive Team comes in.  From their homepage:

Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage. Since 2009 this variant force of nature has caught wind of shutdowns, shutoffs, mergers, and plain old deletions – and done our best to save the history before it’s lost forever.

Sometimes they’ve been able to save the content of a service that’s going dark just by asking for a copy.  Often, however, it has been necessary to crawl the website before the clock runs out.

That’s where the crowdsourcing comes in: by downloading a virtual machine, you too can have your computer become a “Warrior” and use some of its bandwidth to crawl dying websites, then send the data back to the Archive Team’s archive.  From there, the data gets collocated and sent to a variety of places, including the Internet Archive.

This is not necessary polite archiving.  In the name of getting as complete a capture as possible, the archiving appliance intentionally ignores the the robot exclusion protocol that normal web crawlers should follow.  Furthermore, having a crowd of Warriors increases the chance of that the archiving will progress even in the face of rate-limiting, as Yahoo is currently doing on individual computers that download too quickly.

Does this sounds messy?  Sure.  Would a cautious institution want to think twice before running a Warrior? Perhaps — the cause is worthy, but the potential for liability is uncertain if a website operator decided to call an archiving effort a distributed denial-of-service attack.

Is it necessary?  I believe that it is, so I’m running a Warrior.

The virtual machine, which runs on top of VirtualBox or the like, is dead simple to use, and you can control which projects the Warrior will participate in.  Besides Yahoo Message, the Archive Team is also currently archiving the blogging service Posterous, which is due to go dark at the end of April.

Since Yahoo Messages is going dark less than nine days from now, I encourage folks to consider pitching in now.  Think of it as the WOZ corollary to LOCKSS: Waves of Zergs create the archive.  Then we can have the stuff for Lots of Copies Keep Stuff Safe.

The other discovery I made today?  Just Google for “zerg rush” and wait a moment.

CC BY-SA 4.0 Crowdsourcing archiving by Galen Charlton is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.