One number I quite like today is 99. That’s the difference between the count of explicitly enumerated tests in Koha’s master branch as of 19 May (1,837) and the count today (1,936)1. So far in the 3.14 cycle, eleven people have contributed patches that touch t/.

In particular, there’s been quite a bit of work on the database-dependent test suite that has increased both its coverage and its usability. Database-dependent test cases are useful for several reasons. First, a good bit of Koha’s code simply cannot be tested under realistic conditions if a Koha database isn’t available to talk to; while DBD::Mock can be use to mock query responses, it can be tedious to write the mocks. Second, test scripts that can use a database can readily exercise not just individual routines, but higher-level workflows. For example, it would be feasible to write a set of tests that creates a loan, renews it, simulates it becoming overdue, charges overdue fines, then returns the loan. In turn, being able to test larger sequences of actions can make it easier to avoid cases where a seemingly innocuous change to one core routine has an unanticipated effect elsewhere. This consideration particularly matters for Koha’s circulation code.

The automated buildbot has been running the DB-dependent tests for some time, but it’s historically been a dicier proposition for the average Koha hacker to run them on their own development databases. On the one hand, you probably don’t want to risk letting a test case mess up your database. On the other hand, some of the test cases make assumptions about the initial state of the database that may be unwarranted.

Although letting the buildbot do its thing is good, test cases are most useful if developers are able and willing to run them at any time. Worrying about damage to your development DB or having to figure out fiddly preconditions both decrease the probability that the tests will be run.

Recently, a simple “trick” has been adopted to deal with the first concern: make each DB-dependent test script operate in a transaction that gets rolled back. This is simple to set up:

use Modern::Perl;
use C4::Context;
use Test::More tests => 5;

my $dbh = C4::Context->dbh;
# Start transaction
$dbh->{AutoCommit} = 0;
$dbh->{RaiseError} = 1;

# Testy-testy test test test

$dbh->rollback();

The trick lies in setting AutoCommit on the database handle to 0. Setting RaiseError will cause the test script to abort if a fatal SQL error is raised. The $dbh->rollback() at the end is optional; if you let the script fall through to the end, or if the script terminates unexpectedly, the transaction will get rolled back regardless.

Doing all of the tests inside of a transaction grants you … freedom. Testing circulation policies? You can empty out issuingrules, set up a set of test policies, run through the variations, then end the test script confident that your original loan rules will be back in place.

It also grants you ease. Although it’s a good idea for Koha to let you easily run the tests in a completely fresh database, test cases that can run in your main development database are even better.

This ties into the second concern, which is being addressed by an ongoing project which Jonathan Druart and others have been working on to make each test script create the test data it needs. For example, if a test script needs a patron record, it will add it rather than assume that the database contains one. The DB-dependent tests currently do make a broader assumption that some of the English-language sample data has been loaded (most notably the sample libraries), but I’m confident that that will be resolved by the time 3.14 is released.

I’m seeing a virtuous cycle starting to develop: the safer it gets for Koha devs to run the tests, the more that they will be run — and the more that will get written. In turn, the more test coverage we achieve, the more confidently we can do necessary refactoring. In addition, the more tests we have, the more documentation — executable documentation! — we’ll have of Koha’s internals.


[1] From the top of Koha’s source tree, egrep -ro 'tests => [0-9]+' t |awk '{print $3}'|paste -d+ -s |bc

In the course of looking at the patch for Koha bug 9580 today, I ended playing around with Coce.

Coce is a piece of software written by Frédéric Demians and licensed under the GPL that implements a cache for URLs of book cover images. It arose during a discussion of cover images on the Koha development mailing list.

The idea of Coce is rather than have the ILS either directly link to cover images by plugging the normalized ISBN into a URL pattern (as is done for Amazon, Baker & Taylor and Syndetics) or by calling a web service to get the image’s URL (as is done for Google and Open Library), Coce queries the cover image providers and returns the image URLs. Furthermore, Coce caches the URLs, meaning once it determines that the Open Library cover image for ISBN 9780563533191 can be found at http://covers.openlibrary.org/b/id/2520432-L.jpg, it need not ask again, at least for a while.

Having a cache like this provides some advantages:

  • Caching the result of web service calls reduces the load on the providers. That’s nice for the likes of the Open Library, and while even the most ambitious ILS is not likely to discomfit Amazon or Google, it doesn’t hurt to reduce the risk of getting rate-limited during summer reading.
  • Since Coce queries each provider for valid image URLs, users are less likely to see broken cover images in the catalog.
  • Since Coce can query multiple providers (it currently has support for the Open Library, Google Books, and Amazon’s Product Advertising API), more records can have cover images displayed as compared to using just one source.
  • It lends itself to using one Coce instance to service multiple Koha instances.

There are also some disadvantages:

  • It would be yet another service to maintain.
  • It would be another point of failure. On the other hand, it looks like it would be easy to set up multiple, load-balanced instances of Coce.
  • There is the possibility that image URLs might get cached for too long — although I don’t think any of the cover image services are in the habit of changing the static image URLs just for fun, they don’t necessarily guarantee that they will work forever.

I set up Coce on a Debian Wheezy VM. It was relatively simple to install; for posterity here is the procedure I used. First, I installed Redis, which Coce uses as its cache:

sudo apt-get install redis-server

Next, I installed Node.js by building a Debian package, then installing it:

sudo apt-get install python g++ make checkinstall
mkdir ~/src && cd $_
wget -N http://nodejs.org/dist/node-latest.tar.gz
tar xzvf node-latest.tar.gz && cd node-v*
./configure
checkinstall
sudo dpkg -i ./node_0.10.15-1_amd64.deb 

When I got to the point where checkinstall asked me to confirm the metadata for the package, I made sure to remove the “v” from the version number.

Next, I checked out Coce and installed the Node.js packages it needs:

cd ~
git clone https://github.com/fredericd/coce
cd coce
npm install express redis aws-lib util

I then copied ”config.json-sample” to ”config.json” and customized it. The only change I made, though, was to remove Amazon from the list of providers.

Finally, I started the service:

node webservice.js

On my test Koha system, I installed the patch for bug 9580 and set the two system preferences it introduces to appropriate values to point to my Coce instance with the set of cover providers I wanted to use for the test.

The result? It worked: I did an OPAC search, and some of the titles that got displayed had their cover image provided by Google Books, while others were provided by the Open Library.

There are a few rough edges to work out. For example, the desired cover image size should probably be part of the client request to Coce, not part of Coce’s central configuration, and I suspect a bit more work is needed to get it to work properly if the OPAC is run under HTTPS. That said, this looks promising, and I enjoyed the chance to start playing a bit with Redis and Node.js.

Declaration of Independence.

Frederick Douglass: Oration, Delivered in Corinthian Hall, Rochester, by Frederick Douglass, July 5th, 1852.

Ulysses S. Grant: The Siege of Vicksburg (chapter 37 of the Personal Memoirs of U.S. Grant).

Treaty of General Relations of American and the Republic of the Philippines. Signed at Manila, on 4 July 1946 [PDF].

Martin Luther King, Jr.: The American Dream (sermon given on 4 July 1965 at the Ebenezer Baptist Church, Atlanta, Georgia).

W. Caleb McDaniel: To Be Born On The Fourth Of July. Published by Ta-Nehisi Coates.

The current stable version of Perl is 5.18.0 … but for very good reasons, Koha doesn’t require the latest and greatest. For a very long time, Koha required a minimum version of 5.8.8. It wasn’t until October 2011, nearly four years after Perl 5.10.0 was released, that a patch was pushed setting 5.10.0 as Koha’s minimum required version.

Why so long? Since Perl is used by a ton of core system scripts and utilities, OS packagers are reluctant to push ahead too quickly. Debian oldstable has 5.10.1 and Debian stable ships with 5.14.2. Ubuntu tracks Debian in this respect. RHEL5 ships with Perl 5.8 and won’t hit EOL until 2017.

RHEL5 takes it too far in my opinion, unless you really need that degree of stasis — and I’m personally not convinced that staying that far behind the cutting edge necessarily gives one much more in the way of the security. Then again, I don’t work for a bank. Suffice it to say, if you must run a recent version of Koha on RHEL5, you have your work cut out for you — compiling Perl from tarball or using something like Perlbrew to at least get 5.10 is a good idea. That will still leave you with rather a lot of modules to install from CPAN.

But since we, as Koha hackers, can count on having Perl 5.10, we can make the most of it. Here are a few constructs that were added in 5.10 that I find particularly useful for hacking on Koha.

Defined-OR operator

The defined-or operator, //, returns its left operand unless its value is undefined, in which case it returns the right operand. It lets you write:

my $a = get_a_possibly_undefined_value();
$a //= '';
print "Label: $a\n"; # won't throw a warning if the original value was undefined

or

my $a = get_a_possibly_undefined_value() // '';

rather than

my $a = get_a_possibly_undefined_value();
$a = '' unless defined($a);

or (horrors!)

my $a = get_a_possibly_undefined_value();
$a ||= ''; # if $a started out as 0...

Is this just syntactical sugar? Sure, but since Koha is a database-driven application whose schema has a lot of nullable columns, and since use of the Perl warnings pragma is mandated, it’s a handy one.

Named capture buffers

This lets you give a name to a regular expression capture group, allowing you to using the name rather than (say) $1, $2, etc. For example, you can write

if ($str =~ /tag="(?[0-9]{3})"/ ){
    print $+{tag}, "\n"; # %- is a magic hash that contains the named capture groups' contents
}

rather than

if ($str =~ /tag="([0-9]{3})"/ ){
    print $1, "\n";
}

There’s a bit of a trade-off with this because the regular expression is now a little more difficult to read. However, since the code that uses the results can avoid declaring unnecessary temporary variables and is more robust in the face of changes to the number of capture groups in the regex, that trade-off can be worth it.

UNITCHECK blocks

The UNITCHECK block joins BEGIN, END, INIT and CHECK as ways of designating blocks of code to execute during specific points during the compilation process for a Perl module. UNITCHECK code is executed right after the module has been compiled. In the patch I’m proposing for bug 10503, I found this handy to allow module initialization code to make use of functions defined in that same module.

Warning, warning!

There are some constructs that were added in Perl 5.10, including the given/when keywords and the smart match operator ~~, that are deprecated as of Perl 5.18. Consequently, I will say no more about them other than this: don’t use them! Maybe the RHEL5 adherents have a point after all.

I yield!  Not only does Karen Schneider’s conference schedule beat up my schedule, it all but vaporizes it.

Nonetheless, I will be in Chicago, just not bouncing around quite so much.  When I’m not at MPOW‘s booth in the exhibits hall, I’ll be attending:

  • LITA Open Source Systems Interest Group (Saturday, 29 June from 1-2:30 at the Palmer House Hilton, Price Room)
  • ALCTS/LITA MARC Formats Transition Interest Group (Saturday, 29 June from 3-4 at McCormick Place, room E351)
  • LITA/ALCTS Linked Library Data Interest Group (Sunday, 30 June from 8:30-10 at McCormick Place, room N129)
  • LITA Imagineering Interest Group (Sunday, 30 June from 10:30 to 11:30 at McCormick Place, room N134)

I’m outgoing chair of the Open Source Systems IG, so I should mention that during the meeting we will be having a discussion on the organizational structures behind open source software.  For example, user groups–Evergreen has a very organized user group and so does Koha.  Is a foundation beneficial?  How is software development handled?  Is there a release manager?  How is project funding managed?

It promises to be an interesting discussion, so I invite any and all to attend.

The next few days promise to be busy.  Tuesday morning, my first stop is Sea-Tac to pick up a couple other conference attendees who are flying in from the East Coast.  After a stop for lunch, it’s a straight shot up I-5 and BC-99 to Vancouver.

Wednesday morning I’ll be bouncing around among the IG and committee meetings, and making a particular point of joining the Web Team and the Cataloging Working Group meetings.  I plan to spend most of the afternoon at the hackfest.

Thursday looks to be mostly sessions, but you may also find me distributing Evergreen t-shirts.

On Friday, I’ll be part of two presentations.  At noon, I’ll be talking about data quality and Evergreen, and at 2:30 I’ll be joining Rogan Hamby and Robin Johnson to talk about how networking affects Evergreen.

Saturday Friday morning I’ll be joining the other members of the Evergreen Oversight Board (and, I hope, other interested community members!) for our business meeting.  Later Saturday morning the Oversight Board will give an update to the conference.

And other than that?  I’m looking forward to attending the keynotes and catching some sessions.  But most of all, I’m looking forward to seeing friends old and new.

Update 9 April 2013: The Evergreen Oversight Board meeting was rescheduled to 8 a.m. on Friday.

Video Game Cats
CC-BY-NC-ND image “Video Game Cats” by jenbooks on Flckr.

VideoGameCat is a website that “aims to be a resource for educators and librarians interested in using games in educational environments, whether in class or as part of a library collection.” It’s been quiet for a while, but it’s now back with a new design.  The review editor (and former classmate of mine), Shannon Farrell of the University of Minnesota, is looking for folks to contribute reviews and guest posts.

I’m a gamer and a library professional, and I also think that games have a place in collection development policies. There are of course plenty of places on the web to go to find video game reviews, but it’s not every review site that has the information needs of a collection development librarian in mind.

If you’ve been thinking of writing up the last game you’ve finished (or perhaps thrown against the wall!) or introduced in your library, please consider checking VideoGameCat’s page for new reviewers and submitting your review.

Dalek egg frontal view
CC-BY photo by Nancy Sims

It’s always neat to find out that somebody whose work you follow in one context has done something interesting in a completely different field.  Nancy Sims, who is an attorney and the Copyright Program Librarian at the University of Minnesota Libraries, writes the Copyright Librarian blog.  As I found out on Friday when I read her her post On releasing an image to the wilds…, she also decorates eggs… elaborately.  Take a look at our friend on the right, and if you’re a Whovian like me, take another moment to squee.

She posted her photos of the Dalek egg on Flickr under a Creative Commons Attribution license, and the images went viral.  As you might expect, the photos get a spike in interest, including reblogs, around Easter every year. Sometimes the images get attributed properly, sometimes they do not. And sometimes Sims gets requests for permission to use the photos, including one, amusingly enough from the BBC.

Of course, one of the points of the Creative Commons licenses is that you don’t have to ask permission to make use of CC-licensed content as long as you follow the terms of the particular variant that the creator applied. As Sims wrote:

[…] I hate it when people ask for permission to use things that already carry a CC-license sufficient to the purpose.

Further, in her response to the BBC, request, she says:

I am a big fan of Creative Commons licenses, and would like to see them used more (when appropriate) by everyone!

This circles back to the title of my post: Sharing is for curmudgeons, too.  To be clear, I’m not using the word “curmudgeon” to refer to anybody in particular. You may be a curmudgeon most of the time, never, or only just in the morning before caffeine appears. You may simply want to get stuff done quickly, with the least amount of interaction required.

Free and open licenses are perfect for curmudgeons. Why? No need to ask for permission. Need an image of a pristine lawn for your website? You can grab a CC-BY image, put it up with attribution, and never ask permission. Need to tweak your webserver’s software? If it’s free software, you can just go get the source code and modify it — and never ask for permission.

It works in the other direction. If you’ve written a useful little utility for yourself, you can slap on a free software license and publish it on Gitorious… then forget about it. If somebody else finds it useful, great — and they don’t need to bother you about it if they want to fix a bug or enhance it!

Free software licenses help promote community, which is important for any project that is larger in scope than a single person. If we can all see and talk about the code, we can made it better faster. But free and open licenses also reduce friction, and that’s where the curmudgeons of the world come in — often great things come from somebody working alone in her figurative garage.

Curmudgeons of the world — unite! Or not, it’s your choice.

Today I discovered two things that have been around for a while but which are new to me.

Every now and again I’ve lent my computers’ spare cycles to projects like the Great Internet Mersenne Prime Search and SETI@home, both of which have been crowdsourcing scientific computing long before the term “crowdsourcing” became popular.  One of my discoveries today was a project that’s directly related to my professional interests: distributed archiving of websites that are about to go dark.

It all started when this came across my Twitter feed:

@textfiles Yes, you read right, Yahoo! is completely rate-limiting/temp-banning us from making copies of this data they're deleting. ZERG RUSH NEEDED

A Zerg rush on Yahoo?  Say what?  I had visited textfiles.com, an archive of hacker lore, in the past and knew that Jason Scott did interesting things, but had no idea what he was up to now.

It didn’t take much poking around to figure out what’s up.  Yahoo has announced that their Message Boards service is being discontinued at the end of the month.  Of course, there’s no lack of options for places on the web for folks to talk, although I wouldn’t be surprised to hear that there are a few niche communities on the boards that will have to scramble to find a new home.  What can’t be replaced, of course, are the past discussions — and those were made by the users of the service, not by Yahoo.  So far, it doesn’t sound like Yahoo is interested in providing an archive.

That’s where the Archive Team comes in.  From their homepage:

Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage. Since 2009 this variant force of nature has caught wind of shutdowns, shutoffs, mergers, and plain old deletions – and done our best to save the history before it’s lost forever.

Sometimes they’ve been able to save the content of a service that’s going dark just by asking for a copy.  Often, however, it has been necessary to crawl the website before the clock runs out.

That’s where the crowdsourcing comes in: by downloading a virtual machine, you too can have your computer become a “Warrior” and use some of its bandwidth to crawl dying websites, then send the data back to the Archive Team’s archive.  From there, the data gets collocated and sent to a variety of places, including the Internet Archive.

This is not necessary polite archiving.  In the name of getting as complete a capture as possible, the archiving appliance intentionally ignores the the robot exclusion protocol that normal web crawlers should follow.  Furthermore, having a crowd of Warriors increases the chance of that the archiving will progress even in the face of rate-limiting, as Yahoo is currently doing on individual computers that download too quickly.

Does this sounds messy?  Sure.  Would a cautious institution want to think twice before running a Warrior? Perhaps — the cause is worthy, but the potential for liability is uncertain if a website operator decided to call an archiving effort a distributed denial-of-service attack.

Is it necessary?  I believe that it is, so I’m running a Warrior.

The virtual machine, which runs on top of VirtualBox or the like, is dead simple to use, and you can control which projects the Warrior will participate in.  Besides Yahoo Message, the Archive Team is also currently archiving the blogging service Posterous, which is due to go dark at the end of April.

Since Yahoo Messages is going dark less than nine days from now, I encourage folks to consider pitching in now.  Think of it as the WOZ corollary to LOCKSS: Waves of Zergs create the archive.  Then we can have the stuff for Lots of Copies Keep Stuff Safe.

The other discovery I made today?  Just Google for “zerg rush” and wait a moment.

This is the second part in an occasional series on how good data can go bad.

bestiary_viper_thumbnailOne aspect of the MARC standard that sometimes is forgotten is that it was meant to be a cataloging communications format. One could design an ILS that doesn’t use anything resembling MARC 21 to store or express bibliographic data, but as long as its internal structure is sufficiently expressive to keep track of the distinctions called for by AACR2, in principle it could relegate MARC handling strictly to import and export functionality. By doing so, it would follow a conception of MARC as a lingua franca for bibliographic software.

In practice, of course, MARC isn’t just a common language for machines — it’s also part of a common language for catalogers.  If you say “222” or “245” or “780” to one, you’ve communicated a reasonably precise (in the context of AACR2) identification of a metadata attribute.  Sure, it’s arcane, but then again so is most professional jargon to non-practitioners.  MARC also become the basis of record storage and editing in most ILSs, to the point where the act of cataloging is sometimes conflated with the act of creating and editing MARC records.

But MARC’s origins as a communications format can sometimes conflict with its ad hoc role as a storage format.  Consider this record:

00528dam  22001577u 4500
001 123
100 1  $a Strang, Elizabeth Leonard.
245 10 $a Lectures on landscape and gardening design / $c by Elizabeth Leonard Strang.

A brief bibliographic record, right?  Look at the Leader/05, which stores the the record status.  The value ‘d’ means that the record is deleted; other values for that position include ‘n’ for new and ‘c’ for corrected.

But unlike, say, the 245, the Leader/05 isn’t making an assertion about a bibliographic entity.  It’s making an assertion about the metadata record itself, and one that requires more context to make sense.  There can’t be a globally valid assertion that a record is deleted; my public library may have deaccessioned Lectures on landscape and gardening design, but your horticultural library may keep that title indefinitely.

Consequently, the Leader/05 is often ignored when creating or modifying records in an ILS.  For example, if a bib record is present in an Evergreen or Koha database, setting its Leader/05 to ‘d’ does not affect its indexing or display.

However, such records can become undead — not in the context of the ILS, but in the context of exporting them for loading into a discovery layer or a union catalog. Some discovery layers do look at the Leader/05.  If an incoming record is marked as deleted, that is taken as a signal to remove the matching record from the discovery layer’s own indexes.  If there is no matching record, the discovery layer would be reasonable to ignore an incoming “deleted” record — and I know of at least that does exactly that.

The result? A record that appears to be perfectly good in the ILS doesn’t show up in the discovery layer.

Context matters.

I’ll finish with a couple SQL queries for finding such undead records, one for Evergreen:

SELECT record
FROM metabib.full_rec mfr
JOIN biblio.record_entry bre ON (bre.id = mfr.record)
WHERE tag = 'LDR'
AND SUBSTRING(value, 6, 1) = 'd'
AND NOT bre.deleted;

and one for Koha:

SELECT biblionumber
FROM biblioitems 
WHERE ExtractValue(marcxml, 'substring(//leader, 6, 1)') = 'd';

 

CC-BY image of a woodcut of a viper courtesy of the Penn Provenance Project.