{"id":464,"date":"2013-03-22T22:57:30","date_gmt":"2013-03-23T05:57:30","guid":{"rendered":"http:\/\/galencharlton.com\/blog\/?p=464"},"modified":"2013-03-22T22:57:30","modified_gmt":"2013-03-23T05:57:30","slug":"crowdsourcing-archiving","status":"publish","type":"post","link":"https:\/\/galencharlton.com\/blog\/2013\/03\/crowdsourcing-archiving\/","title":{"rendered":"Crowdsourcing archiving"},"content":{"rendered":"<p>Today I discovered two things that have been around for a while but which are new to me.<\/p>\n<p>Every now and again I&#8217;ve lent my computers&#8217; spare cycles to projects like the <a href=\"http:\/\/www.mersenne.org\/\">Great Internet Mersenne Prime Search<\/a> and <a href=\"http:\/\/setiathome.berkeley.edu\/\">SETI@home<\/a>, both of which have been crowdsourcing scientific computing long before the term &#8220;crowdsourcing&#8221; became popular. \u00a0One of my discoveries today was a project that&#8217;s directly related to my professional interests: distributed archiving of websites that are about to go dark.<\/p>\n<p>It all started when this came across my Twitter feed:<\/p>\n<p style=\"text-align: center;\"><a href=\"https:\/\/galencharlton.com\/blog\/wp-content\/uploads\/2013\/03\/zerg2.png\"><img loading=\"lazy\" class=\"aligncenter  wp-image-470\" title=\"@textfiles Yes, you read right, Yahoo! is completely rate-limiting\/temp-banning us from making copies of this data they're deleting. ZERG RUSH NEEDED\" alt=\"@textfiles Yes, you read right, Yahoo! is completely rate-limiting\/temp-banning us from making copies of this data they're deleting. ZERG RUSH NEEDED\" src=\"https:\/\/galencharlton.com\/blog\/wp-content\/uploads\/2013\/03\/zerg2-1024x256.png\" width=\"482\" height=\"120\" srcset=\"https:\/\/galencharlton.com\/blog\/wp-content\/uploads\/2013\/03\/zerg2-1024x256.png 1024w, https:\/\/galencharlton.com\/blog\/wp-content\/uploads\/2013\/03\/zerg2-300x75.png 300w, https:\/\/galencharlton.com\/blog\/wp-content\/uploads\/2013\/03\/zerg2-624x156.png 624w, https:\/\/galencharlton.com\/blog\/wp-content\/uploads\/2013\/03\/zerg2.png 1038w\" sizes=\"(max-width: 482px) 100vw, 482px\" \/><\/a><\/p>\n<p>A Zerg rush on Yahoo? \u00a0Say what? \u00a0I had visited <a href=\"http:\/\/textfiles.com\">textfiles.com<\/a>, an archive of hacker lore, in the past and knew that Jason Scott did interesting things, but had no idea what he was up to now.<\/p>\n<p>It didn&#8217;t take much poking around to figure out what&#8217;s up. \u00a0Yahoo has announced that their <a href=\"http:\/\/messages.yahoo.com\/\">Message Boards<\/a> service is being discontinued at the end of the month. \u00a0Of course, there&#8217;s no lack of options for places on the web for folks to talk, although I wouldn&#8217;t be surprised to hear that there are a few niche communities on the boards that will have to scramble to find a new home. \u00a0What can&#8217;t be replaced, of course, are the past discussions &#8212; and those were made by the users of the service, not by Yahoo. \u00a0So far, it doesn&#8217;t sound like Yahoo is interested in providing an archive.<\/p>\n<p>That&#8217;s where the <a href=\"http:\/\/archiveteam.org\/index.php?title=Main_Page\">Archive Team<\/a> comes in. \u00a0From their homepage:<\/p>\n<blockquote><p>Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage. Since 2009 this variant force of nature has caught wind of shutdowns, shutoffs, mergers, and plain old deletions &#8211; and done our best to save the history before it&#8217;s lost forever.<\/p><\/blockquote>\n<p>Sometimes they&#8217;ve been able to save the content of a service that&#8217;s going dark just by asking for a copy. \u00a0Often, however, it has been necessary to crawl the website before the clock runs out.<\/p>\n<p>That&#8217;s where the crowdsourcing comes in: by downloading a virtual machine, you too can have your computer become a &#8220;<a href=\"http:\/\/archiveteam.org\/index.php?title=ArchiveTeam_Warrior\">Warrior<\/a>&#8221; and use some of its bandwidth to crawl dying websites, then send the data back to the Archive Team&#8217;s archive. \u00a0From there, the data gets collocated and sent to a variety of places, including the Internet Archive.<\/p>\n<p>This is not necessary <em>polite<\/em> archiving. \u00a0In the name of getting as complete a capture as possible, the archiving appliance intentionally <a href=\"http:\/\/www.archiveteam.org\/index.php?title=Robots.txt\">ignores<\/a> the the robot exclusion protocol that normal web crawlers should follow. \u00a0Furthermore, having a crowd of Warriors increases the chance of that the archiving will progress even in the face of rate-limiting, as Yahoo is currently doing on individual computers that download too quickly.<\/p>\n<p>Does this sounds messy? \u00a0Sure. \u00a0Would a cautious institution want to think twice before running a Warrior? Perhaps &#8212; the cause is worthy, but the potential for liability is uncertain if a website operator decided to call an archiving effort a distributed denial-of-service attack.<\/p>\n<p>Is it necessary? \u00a0I believe that it is, so I&#8217;m running a Warrior.<\/p>\n<p>The virtual machine, which runs on top of VirtualBox or the like, is dead simple to use, and you can control which projects the Warrior will participate in. \u00a0Besides Yahoo Message, the Archive Team is also currently archiving the blogging service Posterous, which is due to go dark at the end of April.<\/p>\n<p>Since Yahoo Messages is going dark less than nine days from now, I encourage folks to consider pitching in <em>now<\/em>. \u00a0Think of it as the WOZ\u00a0corollary to LOCKSS: Waves of Zergs create the archive. \u00a0<em>Then<\/em> we can have the stuff for Lots of Copies Keep Stuff Safe.<\/p>\n<p>The other discovery I made today? \u00a0Just Google for &#8220;zerg rush&#8221; and wait a moment.<\/p>\n<div class=\"sharedaddy sd-sharing-enabled\"><div class=\"robots-nocontent sd-block sd-social sd-social-icon-text sd-sharing\"><h3 class=\"sd-title\">Share this:<\/h3><div class=\"sd-content\"><ul><li class=\"share-twitter\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"sharing-twitter-464\" class=\"share-twitter sd-button share-icon\" href=\"https:\/\/galencharlton.com\/blog\/2013\/03\/crowdsourcing-archiving\/?share=twitter\" target=\"_blank\" title=\"Click to share on Twitter\"><span>Twitter<\/span><\/a><\/li><li><a href=\"#\" class=\"sharing-anchor sd-button share-more\"><span>More<\/span><\/a><\/li><li class=\"share-end\"><\/li><\/ul><div class=\"sharing-hidden\"><div class=\"inner\" style=\"display: none;\"><ul><li class=\"share-tumblr\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-tumblr sd-button share-icon\" href=\"https:\/\/galencharlton.com\/blog\/2013\/03\/crowdsourcing-archiving\/?share=tumblr\" target=\"_blank\" title=\"Click to share on Tumblr\"><span>Tumblr<\/span><\/a><\/li><li class=\"share-reddit\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-reddit sd-button share-icon\" href=\"https:\/\/galencharlton.com\/blog\/2013\/03\/crowdsourcing-archiving\/?share=reddit\" target=\"_blank\" title=\"Click to share on Reddit\"><span>Reddit<\/span><\/a><\/li><li class=\"share-end\"><\/li><li class=\"share-print\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-print sd-button share-icon\" href=\"https:\/\/galencharlton.com\/blog\/2013\/03\/crowdsourcing-archiving\/\" target=\"_blank\" title=\"Click to print\"><span>Print<\/span><\/a><\/li><li class=\"share-end\"><\/li><\/ul><\/div><\/div><\/div><\/div><\/div>","protected":false},"excerpt":{"rendered":"<p>Today I discovered two things that have been around for a while but which are new to me. Every now and again I&#8217;ve lent my&#8230;<\/p>\n<div class=\"sharedaddy sd-sharing-enabled\"><div class=\"robots-nocontent sd-block sd-social sd-social-icon-text sd-sharing\"><h3 class=\"sd-title\">Share this:<\/h3><div class=\"sd-content\"><ul><li class=\"share-twitter\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"sharing-twitter-464\" class=\"share-twitter sd-button share-icon\" href=\"https:\/\/galencharlton.com\/blog\/2013\/03\/crowdsourcing-archiving\/?share=twitter\" target=\"_blank\" title=\"Click to share on Twitter\"><span>Twitter<\/span><\/a><\/li><li><a href=\"#\" class=\"sharing-anchor sd-button share-more\"><span>More<\/span><\/a><\/li><li class=\"share-end\"><\/li><\/ul><div class=\"sharing-hidden\"><div class=\"inner\" style=\"display: none;\"><ul><li class=\"share-tumblr\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-tumblr sd-button share-icon\" href=\"https:\/\/galencharlton.com\/blog\/2013\/03\/crowdsourcing-archiving\/?share=tumblr\" target=\"_blank\" title=\"Click to share on Tumblr\"><span>Tumblr<\/span><\/a><\/li><li class=\"share-reddit\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-reddit sd-button share-icon\" href=\"https:\/\/galencharlton.com\/blog\/2013\/03\/crowdsourcing-archiving\/?share=reddit\" target=\"_blank\" title=\"Click to share on Reddit\"><span>Reddit<\/span><\/a><\/li><li class=\"share-end\"><\/li><li class=\"share-print\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-print sd-button share-icon\" href=\"https:\/\/galencharlton.com\/blog\/2013\/03\/crowdsourcing-archiving\/\" target=\"_blank\" title=\"Click to print\"><span>Print<\/span><\/a><\/li><li class=\"share-end\"><\/li><\/ul><\/div><\/div><\/div><\/div><\/div>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","jetpack_publicize_message":"Crowdsourcing archiving http:\/\/wp.me\/p3gJ9y-7u @archiveteam #FeedTheLOCKSS","jetpack_is_tweetstorm":false},"categories":[6],"tags":[],"jetpack_featured_media_url":"","jetpack_publicize_connections":[],"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p3gJ9y-7u","_links":{"self":[{"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/posts\/464"}],"collection":[{"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/comments?post=464"}],"version-history":[{"count":22,"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/posts\/464\/revisions"}],"predecessor-version":[{"id":488,"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/posts\/464\/revisions\/488"}],"wp:attachment":[{"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/media?parent=464"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/categories?post=464"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/tags?post=464"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}