Evergreen Conference 2013 road trip

The next few days promise to be busy.  Tuesday morning, my first stop is Sea-Tac to pick up a couple other conference attendees who are flying in from the East Coast.  After a stop for lunch, it’s a straight shot up I-5 and BC-99 to Vancouver.

Wednesday morning I’ll be bouncing around among the IG and committee meetings, and making a particular point of joining the Web Team and the Cataloging Working Group meetings.  I plan to spend most of the afternoon at the hackfest.

Thursday looks to be mostly sessions, but you may also find me distributing Evergreen t-shirts.

On Friday, I’ll be part of two presentations.  At noon, I’ll be talking about data quality and Evergreen, and at 2:30 I’ll be joining Rogan Hamby and Robin Johnson to talk about how networking affects Evergreen.

Saturday Friday morning I’ll be joining the other members of the Evergreen Oversight Board (and, I hope, other interested community members!) for our business meeting.  Later Saturday morning the Oversight Board will give an update to the conference.

And other than that?  I’m looking forward to attending the keynotes and catching some sessions.  But most of all, I’m looking forward to seeing friends old and new.

Update 9 April 2013: The Evergreen Oversight Board meeting was rescheduled to 8 a.m. on Friday.

Call for reviewers for VideoGameCat

Video Game Cats

CC-BY-NC-ND image “Video Game Cats” by jenbooks on Flckr.

VideoGameCat is a website that “aims to be a resource for educators and librarians interested in using games in educational environments, whether in class or as part of a library collection.” It’s been quiet for a while, but it’s now back with a new design.  The review editor (and former classmate of mine), Shannon Farrell of the University of Minnesota, is looking for folks to contribute reviews and guest posts.

I’m a gamer and a library professional, and I also think that games have a place in collection development policies. There are of course plenty of places on the web to go to find video game reviews, but it’s not every review site that has the information needs of a collection development librarian in mind.

If you’ve been thinking of writing up the last game you’ve finished (or perhaps thrown against the wall!) or introduced in your library, please consider checking VideoGameCat’s page for new reviewers and submitting your review.

Sharing is for curmudgeons, too

Dalek egg frontal view

CC-BY photo by Nancy Sims

It’s always neat to find out that somebody whose work you follow in one context has done something interesting in a completely different field.  Nancy Sims, who is an attorney and the Copyright Program Librarian at the University of Minnesota Libraries, writes the Copyright Librarian blog.  As I found out on Friday when I read her her post On releasing an image to the wilds…, she also decorates eggs… elaborately.  Take a look at our friend on the right, and if you’re a Whovian like me, take another moment to squee.

She posted her photos of the Dalek egg on Flickr under a Creative Commons Attribution license, and the images went viral.  As you might expect, the photos get a spike in interest, including reblogs, around Easter every year. Sometimes the images get attributed properly, sometimes they do not. And sometimes Sims gets requests for permission to use the photos, including one, amusingly enough from the BBC.

Of course, one of the points of the Creative Commons licenses is that you don’t have to ask permission to make use of CC-licensed content as long as you follow the terms of the particular variant that the creator applied. As Sims wrote:

[...] I hate it when people ask for permission to use things that already carry a CC-license sufficient to the purpose.

Further, in her response to the BBC, request, she says:

I am a big fan of Creative Commons licenses, and would like to see them used more (when appropriate) by everyone!

This circles back to the title of my post: Sharing is for curmudgeons, too.  To be clear, I’m not using the word “curmudgeon” to refer to anybody in particular. You may be a curmudgeon most of the time, never, or only just in the morning before caffeine appears. You may simply want to get stuff done quickly, with the least amount of interaction required.

Free and open licenses are perfect for curmudgeons. Why? No need to ask for permission. Need an image of a pristine lawn for your website? You can grab a CC-BY image, put it up with attribution, and never ask permission. Need to tweak your webserver’s software? If it’s free software, you can just go get the source code and modify it — and never ask for permission.

It works in the other direction. If you’ve written a useful little utility for yourself, you can slap on a free software license and publish it on Gitorious… then forget about it. If somebody else finds it useful, great — and they don’t need to bother you about it if they want to fix a bug or enhance it!

Free software licenses help promote community, which is important for any project that is larger in scope than a single person. If we can all see and talk about the code, we can made it better faster. But free and open licenses also reduce friction, and that’s where the curmudgeons of the world come in — often great things come from somebody working alone in her figurative garage.

Curmudgeons of the world — unite! Or not, it’s your choice.

Crowdsourcing archiving

Today I discovered two things that have been around for a while but which are new to me.

Every now and again I’ve lent my computers’ spare cycles to projects like the Great Internet Mersenne Prime Search and SETI@home, both of which have been crowdsourcing scientific computing long before the term “crowdsourcing” became popular.  One of my discoveries today was a project that’s directly related to my professional interests: distributed archiving of websites that are about to go dark.

It all started when this came across my Twitter feed:

@textfiles Yes, you read right, Yahoo! is completely rate-limiting/temp-banning us from making copies of this data they're deleting. ZERG RUSH NEEDED

A Zerg rush on Yahoo?  Say what?  I had visited textfiles.com, an archive of hacker lore, in the past and knew that Jason Scott did interesting things, but had no idea what he was up to now.

It didn’t take much poking around to figure out what’s up.  Yahoo has announced that their Message Boards service is being discontinued at the end of the month.  Of course, there’s no lack of options for places on the web for folks to talk, although I wouldn’t be surprised to hear that there are a few niche communities on the boards that will have to scramble to find a new home.  What can’t be replaced, of course, are the past discussions — and those were made by the users of the service, not by Yahoo.  So far, it doesn’t sound like Yahoo is interested in providing an archive.

That’s where the Archive Team comes in.  From their homepage:

Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage. Since 2009 this variant force of nature has caught wind of shutdowns, shutoffs, mergers, and plain old deletions – and done our best to save the history before it’s lost forever.

Sometimes they’ve been able to save the content of a service that’s going dark just by asking for a copy.  Often, however, it has been necessary to crawl the website before the clock runs out.

That’s where the crowdsourcing comes in: by downloading a virtual machine, you too can have your computer become a “Warrior” and use some of its bandwidth to crawl dying websites, then send the data back to the Archive Team’s archive.  From there, the data gets collocated and sent to a variety of places, including the Internet Archive.

This is not necessary polite archiving.  In the name of getting as complete a capture as possible, the archiving appliance intentionally ignores the the robot exclusion protocol that normal web crawlers should follow.  Furthermore, having a crowd of Warriors increases the chance of that the archiving will progress even in the face of rate-limiting, as Yahoo is currently doing on individual computers that download too quickly.

Does this sounds messy?  Sure.  Would a cautious institution want to think twice before running a Warrior? Perhaps — the cause is worthy, but the potential for liability is uncertain if a website operator decided to call an archiving effort a distributed denial-of-service attack.

Is it necessary?  I believe that it is, so I’m running a Warrior.

The virtual machine, which runs on top of VirtualBox or the like, is dead simple to use, and you can control which projects the Warrior will participate in.  Besides Yahoo Message, the Archive Team is also currently archiving the blogging service Posterous, which is due to go dark at the end of April.

Since Yahoo Messages is going dark less than nine days from now, I encourage folks to consider pitching in now.  Think of it as the WOZ corollary to LOCKSS: Waves of Zergs create the archive.  Then we can have the stuff for Lots of Copies Keep Stuff Safe.

The other discovery I made today?  Just Google for “zerg rush” and wait a moment.

Bestiary of Monstrous Data #2: I’m not dead yet!

This is the second part in an occasional series on how good data can go bad.

bestiary_viper_thumbnailOne aspect of the MARC standard that sometimes is forgotten is that it was meant to be a cataloging communications format. One could design an ILS that doesn’t use anything resembling MARC 21 to store or express bibliographic data, but as long as its internal structure is sufficiently expressive to keep track of the distinctions called for by AACR2, in principle it could relegate MARC handling strictly to import and export functionality. By doing so, it would follow a conception of MARC as a lingua franca for bibliographic software.

In practice, of course, MARC isn’t just a common language for machines — it’s also part of a common language for catalogers.  If you say “222″ or “245″ or “780″ to one, you’ve communicated a reasonably precise (in the context of AACR2) identification of a metadata attribute.  Sure, it’s arcane, but then again so is most professional jargon to non-practitioners.  MARC also become the basis of record storage and editing in most ILSs, to the point where the act of cataloging is sometimes conflated with the act of creating and editing MARC records.

But MARC’s origins as a communications format can sometimes conflict with its ad hoc role as a storage format.  Consider this record:

00528dam  22001577u 4500
001 123
100 1  $a Strang, Elizabeth Leonard.
245 10 $a Lectures on landscape and gardening design / $c by Elizabeth Leonard Strang.

A brief bibliographic record, right?  Look at the Leader/05, which stores the the record status.  The value ‘d’ means that the record is deleted; other values for that position include ‘n’ for new and ‘c’ for corrected.

But unlike, say, the 245, the Leader/05 isn’t making an assertion about a bibliographic entity.  It’s making an assertion about the metadata record itself, and one that requires more context to make sense.  There can’t be a globally valid assertion that a record is deleted; my public library may have deaccessioned Lectures on landscape and gardening design, but your horticultural library may keep that title indefinitely.

Consequently, the Leader/05 is often ignored when creating or modifying records in an ILS.  For example, if a bib record is present in an Evergreen or Koha database, setting its Leader/05 to ‘d’ does not affect its indexing or display.

However, such records can become undead — not in the context of the ILS, but in the context of exporting them for loading into a discovery layer or a union catalog. Some discovery layers do look at the Leader/05.  If an incoming record is marked as deleted, that is taken as a signal to remove the matching record from the discovery layer’s own indexes.  If there is no matching record, the discovery layer would be reasonable to ignore an incoming “deleted” record — and I know of at least that does exactly that.

The result? A record that appears to be perfectly good in the ILS doesn’t show up in the discovery layer.

Context matters.

I’ll finish with a couple SQL queries for finding such undead records, one for Evergreen:

and one for Koha:

 

CC-BY image of a woodcut of a viper courtesy of the Penn Provenance Project.

A pause to reflect

Libraries are sneaky, crafty places.  If you walk into one, things may never look the same when you walk out.

Libraries are dangerous places.  If you open your mind in one, you may be forever changed.

And, more mundanely, university libraries are places that employ a lot of work-study students.  I was one of them at Ganser Library at Millersville University.  Although I’ve always been a bookish lad, when I started as a reference shelver at Ganser I wasn’t thinking of the job as anything more than a way to pay the rent while I pursued a degree in mathematics.  And, of course, there were decidedly limits to how much fascination I found filing updated pages in a set of the loose-leaf CCH tax codes.  While some of the cases I skimmed were interesting, I can safely say that a career in tax accountancy was not in my future, either then or now.

Did I mention that libraries are crafty?  Naturally, much of the blame for that attaches to the librarians. As time passed, I ended up working in just about every department of the library, from circulation to serials to systems, as if there were a plot to have me learn to love every nook and cranny of that building and the folks who made it live.  By the time I graduated, math degree in hand, I had accepted a job with an ILS vendor, directly on the strength of the work I had done to help the library migrate to the (at the time) hot new ILS.

While writing this post, it has hit me hard how much I owe an incredible debt of gratitude to my mentors at Ganser.  To name some of them, Scott Anderson, Krista Higham, Barbara Hunsberger, Sally Levit, Marilyn Parrish, Elaine Pease, Leo Shelley, Marjorie Warmkessel, and David Zubatsky have each taught me much, professionally and personally.  To be counted among them as a member of the library profession is an honor.

Today I have an opportunity to toot my horn a bit, having been named one of the “Movers and Shakers” this year by Library Journal.  I am grateful for the recognition, as well as the opportunity to sneak a penguin into the pages of LJ.

Original image by Larry Ewing

Original image by Larry Ewing

Why a penguin? In part, simply because that’s how my whimsy runs. But there’s also a serious side to my choice, and I’m happy that the photographer and editors ran with it. Tux the penguin is a symbol of the open source Linux project, and moreover is a symbol that the Linux community rallies behind. Why have I emphasized community? Because it’s the strength of the library open source communities, particularly those of the Koha and Evergreen projects, that inspire me from day to day. Not that it’s all sunshine and kittens — any strong community will have its share of disappointments and conflicts. However, I deeply believe that open source software is a necessary part of librarians (I use that term broadly) building their own tools with which to share knowledge (and I use that term very broadly) with the wider communities we serve.

The recognition that LJ has given me for my work for Koha and Evergreen is very flattering, but for me it is at heart an opportunity to reflect, and to thank the many friends and mentors in libraryland I have met over the years.

Thanks, and may the work we share ever continue.

Thanks for the tip

I wasn’t one of the people viscerally affected by Google’s announcement of the forthcoming shutdown of Google Reader, since so far I’ve relied on a combination of standalone RSS clients and antediluvian hit-the-refresh-button-repeatedly habits.  I am dinosaur: hear me roar!

However, the announcement prompted me to take another look at RSS readers.  Nowadays, I rotate among my PC, phone, tablet, and laptop frequently, so going back to a purely standalone reader like NetNewsWire wasn’t appealing.  The online services like NewsBlur would fit my needs better than a standalone reader (and I’m willing, even happy, to pay for the hope of longevity), but suffer from two disadvantages.  They’re not open source, and as a consequence, it’s hard to dig through their guts.  And I like digging!

I first found out about Tiny Tiny RSS from one of Ed Corrado’s tweets, and like a lot of people, visited the website and saw… a whole lotta nothing at first.  Google really ought to give their open source competitors a little more warning of service cancellation announcements so that they can beef up their web hosting in advance!

Fortunately, I persevered and installed it.  Tiny Tiny RSS, which is primarily written and maintained by Andrew Dolgov, is a web-based feed reader that you can install on your web server.  It’s licensed under version 2 of the GPL.  Installation is very simple if you have a VPS, and looks fairly easy to install on shared hosting as long as it provides MySQL or PostgreSQL databases.  Once you have it running, you can easy import an OPML file or manually subscribe to feeds.  The web interface for reading and managing feeds is clean and responsive. It also has an API and there are at least two Android clients. I couldn’t find an iOS client, but I suspect that somebody will scratch that itch soon.

So far, I’m quite happy with it.  Thanks, Ed!  Thanks, Andrew and all of the the other contributors!

Return to sender

In his column in American Libraries today, Will Manley makes a good point that librarians should think twice about agreeing to projects that — no matter how useful — don’t add to the library’s mission. In fact, librarians can even say “no” every now and again. Unfortunately, I found that the column has a few too many cheap shots, detracting from Manley’s message.

Manley’s target? A proposal floated by the U.S. Postal Service to offer retail postal services via partner libraries. It’s understandable that the idea should raise eyebrows among librarians. After all, the IRS program to distribute tax forms through libraries has been a perfect example of an unfunded federal mandate from the point of view of libraries that find themselves turning into ad hoc tax advice services every spring. (And as far as I know, nobody’s offering a joint MLS/tax accountancy degree.) While providing tax forms is a useful service, it’s not clear that it’s one that libraries need to be involved in, or that being involved furthers library aims.

Where Manley goes too far is in a series of lazy clichés about the USPS:

After going billions of dollars into debt and being almost aced out of business by the double whammy of email and private-sector carriers that actually deliver your letters and packages on time and in good condition, the USPS is finally thinking outside of the post office box: The agency has hatched the concept of putting post office kiosks in libraries.

Aced out of business by private competition? There’s no doubt that the environment has drastically changed for the USPS, but it doesn’t follow that the shift from letters to email has made it a dinosaur. A (to say the least) challenging oversight structure and uniquely onerous pension funding requirements imposed on the USPS by Congress have handicapped its ability to react. The USPS covers more territory at cheaper rates than postal systems in many other countries.  Also, it covers rural areas that private firms either would not serve at all or only at exorbitant rates.

Suffice it to say, I generally like the USPS — a stint living in Alaska tends to do that to one. The USPS also has a mandate that is very consonant with library values: universal service.

Of course, whether or not the USPS is fairly treated by Manley doesn’t speak to whether a library should agree to start selling stamps and collecting mail. It’s certainly a stretch from traditional services. But a little digging turned up a big difference from the IRS program: it’s not an unfunded mandate. The “Village Post Office” program, as it’s called, does offer compensation to the small businesses (and libraries!) that operate them. For a struggling library in a rural community whose post office has recently closed or reduced hours, starting a VPO could be a net gain.

Indeed, librarians should know how to say “no”. But they also should know to do their due diligence before deciding.

Exploring memcached caches

Both Koha and Evergreen use memcached to cache user sessions and data that would be expensive to continually fetch and refetch from the database. For example, Koha uses memcached to cache MARC frameworks, while Evergreen caches search results, bibliographic added content, search suggestions, and other data.

Even though the data that gets cached is transitory, at times it can be useful to look at it. For example, you may need to check to see if some stale data is present in the cache, or you may want to capture some statistics about user sessions that would otherwise be lost when the cache expires.

The library libMemcached include several command-line utilities for interrogating a memcached server. We’ll look at memcdump and memccat.

memcdump prints a list of keys that are (or were, since the data may have expired) stored in a memcached server. Here’s an example of the sorts of keys you might see in an Evergreen system:

memcdump --servers 127.0.0.1:11211
oils_AS_21a5dc5cd2aa42ee7c0ecc239dcb25b5
ac.toc.html.0531301990
open-ils.search_9fd0c6c3553e6979fc63aa634a78b362_facets
open-ils.search_9fd0c6c3553e6979fc63aa634a78b362
oils_auth_8682b1017b7b27035576fecbfc7715c4

The --servers 127.0.0.1:11211 bit tells memcdump to check memcached running on the local server.

A list of keys, however, doesn’t tell you much. To see the value that’s stored under that key, use memccat. Here’s an example of looking at a user session record in Koha (assuming you’ve set the SessionStorage system preference to use memcached):

memccat --servers 127.0.0.1:11211 KOHA78c879b9942dee326710ce8e046acede
---
_SESSION_ATIME: '1363060711'
_SESSION_CTIME: '1363060711'
_SESSION_ID: 78c879b9942dee326710ce8e046acede
_SESSION_REMOTE_ADDR: 192.168.1.16
branch: CPL
branchname: Centerville
cardnumber: cat
emailaddress: ''
firstname: ''
flags: 1
id: cat
ip: 192.168.1.16
lasttime: '1363060711'
number: 51
surname: cat

And here’s an example of an Evergreen user session cached object:

memccat --servers 127.0.0.1:11211 oils_auth_8682b1017b7b27035576fecbfc7715c4
{"authtime":420,"userobj":{"__c":"au","__p":[null,null,null,null,null,null,null,null,null,"186",null,"t",null,"f",119284,38997,0,0,"2011-05-31T11:17:16-0400","0.00","1-888-555-1234","1923-01-01T00:00:00-0500","user@example.org",null,"2015-10-29T00:00:00-0400","User","Test",186,654440,3,null,null,null,"1358890660.7173220299.6945940294",119284,"f",1,null,"",null,null,10,null,1,null,"t",654440,"user",null,"f","2013-01-22T16:37:40-0500",null,"f"]}}

We’ll let the YAMLites and JSONistas square off outside, and take a look at a final example. This is an excerpt a cached catalog search result in Evergreen:

memccat --servers 127.0.0.1:11211 open-ils.search_4b81a8a59544e8c7e9fdcda357d7b05f
{"0":{"summary":{"checked":630,"visible":"546","excluded":84,"deleted":0,"total":630,"complex_query":1},"results":[["74093"],["130197"], ...., ["880940"],["574457"]]}}

There are other tools that let you manipulate the cache, including memcrm to remove keys and memccp to load key/value pairs into memcached.

For a complete list of the command-line tools provided by libMemcached, check out its documentation. To install them on Debian or Ubuntu, run apt-get install libmemcached-tools. Note that the Debian package renames the tools from ‘memdump’ to ‘memcdump’, ‘memcat’ to ‘memccat’, etc., to avoid a naming conflict with another package.

Bestiary of Monstrous Data #1: a trip through the character mangler

This is the first part in an occasional series on how good data can go bad.bestiary_viper_thumbnail

Consider the following snippets of a MARC21 record for the Spanish edition of the fourth Harry Potter book.

00998nam  2200313 c 4500
...
240 10 $a Harry Potter and the goblet of fire $l Español
245 10 $a Harry Potter y el cáliz de fuego / $c J.K. Rowling ; [traducción, Adolfo Muñoz García y Nieves Martín Azofra]

The original record uses the Unicode character set with the UTF-8 character encoding. However, If you load this record into a modern ILS, e.g. Koha or Evergreen, the title is likely to end up displayed as:

Harry Potter y el c©Łliz de fuego / J.K. Rowling ; [traducci©đn, Adolfo Mu©łoz Garc©Ưa y Nieves Mart©Ưn Azofra]

Too much copyright! This isn’t an electronic course reserves blog!

What happened? Look at the 9th position of the leader (counting from zero), and you’ll see that it is blank. In MARC21, blank means that the record uses the MARC-8 character set, while ‘a’ means that it uses Unicode. Many, if not most, modern MARC tools will go by the Leader/09 to decide if a character conversion is needed. If the leader position is wrong, poor, defenseless diacritics will get mangled.

Why are there so many copyright signs in the mistreated title? As it happens, the UTF-8 representation of many common characters with Western European diacritics starts with byte 195 (or C3 in hexadecimal). What does C3 mean in the MARC-8 character encoding? You’ve guessed it: the copyright symbol.

There are a couple lessons to draw from this. First, using a good character encoding isn’t enough; you must also say what you’re up to. Second, if you look at enough bad data, you will start to recognize patterns on sight. If you deal with a lot of data, that “second sight” is an arcane but useful skill to develop.

CC-BY image of a woodcut of a viper courtesy of the Penn Provenance Project.