At the first face-to-face meeting of the LITA Patron Privacy Technologies Interest Group at Midwinter, one of the attendees mentioned that they had sent out an RFP last year for library databases. One of the questions on the RFP asked how user passwords were stored — and a number of vendors responded that their systems stored passwords in plain text.

Here’s what I tweeted about that, and here is Dorothea Salo’s reply:

https://twitter.com/LibSkrat/status/561605951656976384

This is a repeatable response, by the way — much like the way a hammer strike to the patellar ligament instigates a reflexive kick, mention of plain-text password storage will trigger an instinctual wail from programmers, sysadmins, and privacy and security geeks of all stripes.

Call it the Vanilla Password Reflex?

I’m not suggesting that you should whisper “plain text passwords” into the ear of your favorite system designer, but if you are the sort to indulge in low and base amusements…

A recent blog post by Eric Hellman discusses the problems with storing passwords in plain text in detail. The upshot is that it’s bad practice — if a system’s password list is somehow leaked, and if the passwords are stored in plain text, it’s trivially easy for a cracker to use those passwords to get into all sorts of mischief.

This matters, even “just” for library reference databases. If we take the right to reader privacy seriously, it has to extend to the databases offered by the library — particularly since many of them have features to store citations and search results in a user’s account.

As Eric mentions, the common solution is to use a one-way cryptographic hash function to transform the user’s password into a bunch of gobbledegook.

For example, “p@ssw05d” might be stored as the following hash:

d242b6313f32c8821bb75fb0660c3b354c487b36b648dde2f09123cdf44973fc

To make it more secure, I might add some random salt and end up with the following salted hash:

$2355445aber$76b62e9b096257ac4032250511057ac4d146146cdbfdd8dd90097ce4f170758a

To log in, the user has to prove that they know the password by supplying it, but rather than compare the password directly, the result of the one-way function applied to the password is compared with the stored hash.

How is this more secure? If a hacker gets the list of password hashes, they won’t be able to deduce the passwords, assuming that the hash function is good enough. What counts as good enough? Well, relatively few programmers are experts in cryptography, but suffice it to say that there does exist a consensus on techniques for managing passwords and authentication.

The idea of one-way functions to encrypt passwords is not new; in fact, it dates back to the 1960s. Nowadays, any programmer who wants to be considered a professional really has no excuse for writing a system that stores passwords in plain text.

Back to the “Vanilla Password Reflex”. It is, of course, not actually a reflex in the sense of an instinctual response to a stimulus — programmers and the like get taught, one way or another, about why storing plain text passwords is a bad idea.

Where does this put the public services librarian? Particularly the one who has no particular reason to be well versed in security issues?

At one level, it just changes the script. If a system is well-designed, if a user asks what their password is, it should be impossible to get an answer to the question. How to respond to a patron who informs you that they’ve forgotten their password? Let them know that you can change it for them. If they respond by wondering why you can’t just tell them, if they’re actually interested in the answer, tell them about one-way functions — or just blame the computer, that’s fine too if time is short.

However, libraries and librarians can have a broader role in educating patrons about online security and privacy practices: leading by example. If we insist that the online services we recommend follow good security design; if we use HTTPS appropriately; if we show that we’re serious about protecting reader privacy, it can only buttress programming that the library may offer about (say) using password managers or avoiding phishing and other scams.

There’s also a direct practical benefit: human nature being what it is, many people use the same password for everything. If you crack an ILS’s password list, you’ve undoubtedly obtained a non-negligible set of people’s online banking passwords.

I’ll end this with a few questions. Many public services librarians have found themselves, like it or not, in the role of providing technical support for e-readers, smartphones, and laptops. How often does online security come up during such interactions? How often to patrons come to the library seeking help against the online bestiary of spammers, phishers, and worse? What works in discussing online security with patrons, who of course can be found at all levels of computer savvy? And what doesn’t?

I invite discussion — not just in the comments section, but also on the mailing list of the Patron Privacy IG.

Yesterday I did some testing of version 4.0.1 of Adobe Digital Editions and verified that it is now using HTTPS when sending ebook usage data to Adobe’s server adelogs.adobe.com.

Of course, because the HTTPS protocol encrypts the datastream to that server, I couldn’t immediately verify that ADE was sending only the information that the privacy statement says it is.

Emphasis is on the word “immediately”.  If you want to find out what a program is sending via HTTPS to a remote server, there are ways to get in the middle.  Here’s how I did this for ADE:

  1. I edited the hosts file to refer “adelogs.adobe.com” to the address of a server under my control.
  2. I used the CA.pl script from openssl to create a certificate authority of my very own, then generated an SSL certificate for “adelogs.adobe.com” signed by that CA.
  3. I put the certificate for my new certificate authority into the trusted root certificates store on my Windows 7 deskstop.
  4. I put the certificate in place on my webserver and wrote a couple simple CGI scripts to emulate the ADE logging data collector and capture what got sent to them.

I then started up ADE and flipped through a few pages of an ebook purchased from Kobo.  Here’s an example of what is now getting sent by ADE (reformatted a bit for readability):

In other words, it’s sending JSON containing… I’m not sure.

The values of the various keys in that structure are obviously Base 64-encoded, but when run through a decoder, the result is just binary data, presumably the result of another layer of encryption.

Thus, we haven’t actually gotten much further towards verifying that ADE is sending only the data they claim to.  That packet of data could be describing my progress reading that book purchased from Kobo… or it could be sending something else.

That extra layer of encryption might be done as protection against a real man-in-the-middle attack targeted at Adobe’s log server — or it might be obfuscating something else.

Either way, the result remains the same: reader privacy is not guaranteed. I think Adobe is now doing things a bit better than they were when they released ADE 4.0, but I could be wrong.

If we as library workers are serious about protection patron privacy, I think we need more than assurances — we need to be able to verify things for ourselves. ADE necessarily remains in the “unverified” column for now.

A couple hours ago, I saw reports from Library Journal and The Digital Reader that Adobe has released version 4.0.1 of Adobe Digital Editions.  This was something I had been waiting for, given the revelation that ADE 4.0 had been sending ebook reading data in the clear.

ADE 4.0.1 comes with a special addendum to Adobe’s privacy statement that makes the following assertions:

  • It enumerates the types of information that it is collecting.
  • It states that information is sent via HTTPS, which means that it is encrypted.
  • It states that no information is sent to Adobe on ebooks that do not have DRM applied to them.
  • It may collect and send information about ebooks that do have DRM.

It’s good to test such claims, so I upgraded to ADE 4.0.1 on my Windows 7 machine and my OS X laptop.

First, I did a quick check of strings in the ADE program itself — and found that it contained an instance of “https://adelogs.adobe.com/” rather than “http://adelogs.adobe.com/”.  That was a good indication that ADE 4.0.1 was in fact going to use HTTPS to send ebook reading data to that server.

Next, I fired up Wireshark and started ADE.  Each time it started, it contacted a server called adeactivate.adobe.com, presumably to verify that the DRM authorization was in good shape.  I then opened and flipped through several ebooks that were already present in the ADE library, including one DRM ebook I had checked out from my local library.

So far, it didn’t send anything to adelogs.adobe.com.  I then checked out another DRM ebook from the library (in this case, Seattle Public Library and its OverDrive subscription) and flipped through it.  As it happens, it still didn’t send anything to Adobe’s logging server.

Finally, I used ADE to fulfill a DRM ePub download from Kobo.  This time, after flipping through the book, it did send data to the logging server.  I can confirm that it was sent using HTTPS, meaning that the contents of the message were encrypted.

To sum up, ADE 4.0.1’s behavior is consistent with Adobe’s claims – the data is no longer sent in the clear and a message was sent to the logging server only when I opened a new commercial DRM ePub.  However, without decrypting the contents of that message, I cannot verify that it only information about that ebook from Kobo.

But even then… why should Adobe be logging that information about the Kobo book? I’m not aware that Kobo is doing anything fancy that requires knowledge of how many pages I read from a book I purchased from them but did not open in the Kobo native app.  Have they actually asked Adobe to collect that information for them?

Another open question: why did opening the library ebook in ADE not trigger a message to the logging server?  Is it because the fulfillmentType specified in the .acsm file was “loan” rather than “buy”? More clarity on exactly when ADE sends reading progress to its logging server would be good.

Finally, if we take the privacy statement at its word, ADE is not implementing a page synchronization feature as some, including myself, have speculated – at least not yet.  Instead, Adobe is gathering this data to “share anonymous aggregated information with eBook providers to enable billing under the applicable pricing model”.  However, another sentence in the statement is… interesting:

While some publishers and distributors may charge libraries and resellers for 30 days from the date of the download, others may follow a metered pricing model and charge them for the actual time you read the eBook.

In other words, if any libraries are using an ebook lending service that does have such a metered pricing model, and if ADE is sending reading progress information to an Adobe server for such ebooks, that seems like a violation of reader privacy. Even though the data is now encrypted, if an Adobe ID is used to authorize ADE, Adobe itself has personally identifying information about the library patron and what they’re reading.

Adobe appears to have closed a hole – but there are still important questions left open. Librarians need to continue pushing on this.

Here is a partial list of various ways I can think of to expose information about library patrons and their search and reading history by use (and misuse) of software used or recommended by libraries.

  • Send a patron’s ebook reading history to a commercial website…
    • … in the clear, for anybody to intercept.
  • Send patron information to a third party…
    • … that does not have an adequate privacy policy.
    • … that has an adequate privacy policy but does not implement it well.
    • … that is sufficiently remote that libraries lack any leverage to punish it for egregious mishandling of patron data.
  • Use an unencrypted protocol to enable a third-party service provider to authenticate patrons or look them up…
    • … such as SIP2.
    • … such as SIP2, with the patron information response message configured to include full contact information for the patron.
    • … or many configurations of NCIP.
    • … or web services accessible over HTTP (as opposed to HTTPS).
  • Store patron PINs and passwords without encryption…
    • … or using weak hashing.
  • Store the patron’s Social Security Number in the ILS patron record.
  • Don’t require HTTPS for a patron to access her account with the library…
    • … or if you do, don’t keep up to date with the various SSL and TLS flaws announced over the years.
  • Make session cookies used by your ILS or discovery layer easy to snoop.
  • Use HTTP at all in your ILS or discovery layer – as oddly enough, many patrons will borrow the items that they search for.
  • Send an unencrypted email…
    • … containing a patron’s checkouts today (i.e., an email checkout receipt).
    • … reminding a patron of his overdue books – and listing them.
    • … listing the titles of the patron’s available hold requests.
  • Don’t encrypt connections between an ILS client program and its application server.
  • Don’t encrypt connections between an ILS application server and its database server.
  • Don’t notice that a rootkit has been running on your ILS server for the past six months.
  • Don’t notice that a keylogger has been running on one of your circulation PCs for the past three months.
  • Fail to keep up with installing operating system security patches.
  • Use the same password for the circulator account used by twenty circulation staff (and 50 former circulation staff) – and never change it.
  • Don’t encrypt your backups.
  • Don’t use the feature in your ILS to enable severing the link between the record of a past loan and the specific patron who took the item out…
    • … sever the links, but retain database backups for months or years.
  • Don’t give your patrons the ability to opt out of keeping track of their past loans.
  • Don’t give your patrons the ability to opt in to keeping track of their past loans.
  • Don’t give the patron any control or ability to completely sever the link between her record and her past circulation history whenever she chooses to.
  • When a patron calls up asking “what books do I have checked out?” … answer the question without verifying that the patron is actually who she says she is.
  • When a parent calls up asking “what books does my teenager have checked out?”… answer the question.
  • Set up your ILS to print out hold slips… that include the full name of the patron. For bonus points, do this while maintaining an open holds shelf.
  • Don’t shred any circulation receipts that patrons leave behind.
  • Don’t train your non-MLS staff on the importance of keeping patron information confidential.
  • Don’t give your MLS staff refreshers on professional ethics.
  • Don’t shut down library staff gossiping about a patron’s reading preferences.
  • Don’t immediately sack a library staff member caught misusing confidential patron information.
  • Have your ILS or discovery interface hosted by a service provider that makes one or more of the mistakes listed above.
  • Join a committee writing a technical standard for library software… and don’t insist that it take patron privacy into account.

Do you have any additions to the list? Please let me know!

Of course, I am not actually advocating disclosing confidential information. Stay tuned for a follow-up post.

One can, in fact, have too many holidays.

Koha uses the DateTime::Set Perl module when (among other things) calculating the next day the library is open. Unfortunately, the more special holidays you have in a Koha database, the more time DateTime::Set takes to initialize itself — and the time appears to grow faster than linearly with the number of holidays.

Jonathan Druart partially addressed this with his patch for bug 11112 by implementing some lazy initialization and caching for Koha::Calendar, but that doesn’t make DateTime::Set‘s constructor itself any faster.

Today I happened to be working on a Koha database that turned out to have duplicate rows in the special_holidays table. In other words, for a given library, there might be four rows all expressing that the library is closed on 15 August 2014. That database contains hundreds of duplicates, which results in an extra 1-3 seconds per circulation operation.

The duplication is not apparent in the calendar editor, alas.

So here’s my first question: has anybody else seen this in their Koha database? The following query will turn up duplicates:

And my second question: assuming that this somehow came about during normal operation of Koha (as opposed to duplicate rows getting directly loaded into the database), does anybody have any ideas how this happened?

This afternoon I’m sitting in the new bibliographic environment breakout session at Code4Lib BC. After taking a look at Mark Jordan’s easyLOD, I decided to play around with putting together a web service for Koha that emits RDF when fed a bib ID. Unlike Magnus Enger’s semantikoha prototype, which uses a Ruby library to convert MARC to RDF, I was trying for an approach that used only Perl (plus XS).

There were are of building blocks available. Putting them together turned out to be a tick more convoluted than I expected.

The Library of Congress has published an XSL stylesheet for converting MODS to RDF. Converting MARC(XML) to MODS is readily done using other stylesheets, also published by LC.

The path seemed clear for a quick-and-dirty prototype — make a copy of svc/bib, copy it to opac/svc/bib and take out the bits for doing updates (we’re not quite ready to make cataloging that collaborative!), and write a few lines to apply two XSLT transformations.

The code was quickly written — but it didn’t work. XML::LibXSLT, which Koha uses to handle XSLT, complained about the modsrdf.xsl stylesheet. Too new! That stylesheet is written in XSLT 2.0, but libxslt, the C library that XML::LibXSLT is based on, only supports XSLT 1.

As it turns out, Perl modules that can handle XSLT are rather thin on the ground. What I ended up doing was:

Installing XML::Saxon::XSLT2, which required…

Installing Saxon-HE, a Java XML and XSLT processor that supports XSLT 2.0, which required…

Installing Inline::Java, which required…

Installing a JDK (I happened to choose OpenJDK).

After all that (and a quick tweak to the modsrdf.xsl stylesheet, I ended up with the following code that did the trick:

This works… but is not satisfying. Making Koha require a JDK just for XSLT 2.0 support is a bit much, for one thing, and it would likely be rather slow if used in production. It’s a pity that there’s still no broad support for XSLT 2.0.

A dead end, most likely, but instructive nonetheless.

Peach Arch.  Photo by Daniel Means.  Licensed under CC-BY-SA and available at http://www.flickr.com/photos/supa_pedro/389603266.
Peach Arch. Photo by Daniel Means. Licensed under CC-BY-SA and available at Flickr.

There is nothing quite like the sense of sheer glee you get when you’re waiting at the border… and have been waiting at the border for a while… and then a new customs inspection lane is opened up. Zoom!

Marlene and I left Seattle this morning to go to the Code4Lib BC conference in Vancouver. Leaving in the morning meant that we missed the lightning talks, and arrived after the breakout sessions had started. Fortunately, folks were quick to welcome us, and I soon fell into the accessibility session.

Accessibility has been on my mind lately, but it’s an area that I’m starting mostly from ground zero with. I knew that designing accessible systems is a Good Idea, I knew about the existence some of the jargon and standards, and I knew that I didn’t know much else — certainly none of the specifics.

Cynthia Ng very kindly shared some pointers with me. For example, it is helpful to know that the Section 508 guidelines is essentially a subset of WCAG 1.0. This is exactly the sort of shortcut (through an apparently intimidating forest) that an expert can effortlessly give to a newbie — and having opportunities to learn from the experts is one of the reasons why I like going to conferences.

The accessibility breakout session charged itself with putting together a list of resources and best practices for accessibility and universal design. As I mentioned above, we arrived in the middle of the breakout session time, but a couple hours was more than enough time to get initial exposure to a lot of ideas and resources. It was exhilarating.

In no particular order, here is a list of various things that I’ll be following up on:

  • The Accessibility Project
  • Guerilla testing
  • The 5 second test
  • Swim lane diagrams
  • The Paciello Group Blog
  • Be careful about putting things in the right sidebar of a three-column layout — a lot of users have been trained by web advertising to completely ignore that region.  Similarly, a graphic with moving parts can get ignored if it looks too much like an ad.
  • The Code4Lib BC accessibility group’s notes
  • Having consistency of branding and look and feel can improve usability — but that can be a challenge when integrating a lot of separate systems (particularly if a library and a vendor have different ideas about whose branding should be foremost).
  • Integrating one’s content strategy with one’s accessibility strategy.  To paraphrase a point that Cynthia made a few times, putting out too much text is a problem for any user.
  • As with so much of software design, iterate early and often. The time to start thinking about accessibility is when you’re 20% of the way through a project, not when you’re 80% done.
  • Standards can help, but only up to a point.  A website could pass an automated WCAG compliance test with flying colors but not actually be usable by anyone.

And there’s another day of conference yet!  I’m quite happy we made the drive up.

Here’s a general question to the world: what reading material do you recommend for folks like me who want to learn more about writing accessible web software?

The next few days will be pretty intense for me, as I’ll be joining friends old and new for the hackfest of the 2013 Koha Conference. Hackfest will be an opportunity for folks to learn things, including how to work with Koha’s code, how and why librarians do the things they do — and how and why developers do the things they do. Stuff will be broken, stuff will be built up again, new features will be added, bugs will be fixed, and along the way, I will be cutting another alpha release of Koha 3.14.

Unfortunately, not everybody will be able to be sitting inside the conference room in Reno for the next three days. How can one participate from afar? Lots of ways:

  • Read the koha-devel mailing list and join the conversation. I will, at minimum, post a summary to koha-devel each day.
  • Follow the #kohacon13 hashtag on Twitter. Tweet to us using that hashtag if you have a question or request.
  • Look for blog posts from hackfest.
  • Join the #koha IRC channel.
  • Keep an eye on changes on the Koha wiki, particularly the roundtable notes and hackfest wishlist pages. If you’ve got additions, corrections, or clarifications to offer, please feel free to let us know or to edit the wiki pages directly.
  • Watch the Koha dashboard for patches to test and to see the progress made during hackfest.
  • Test and sign off on patches. BibLibre’s sandboxes make that super-duper simple.

Hackfest isn’t just for folks who know their way around the code — if you know about library practice, or have time to test things, or can write up documentation, you can help too!

We may also try setting up a Google hangout. Because Google Hangout has a limit on the number of simultaneous users, if you’re interested in joining one, please let me know. If you have suggestions for other ways that folks can participate remotely, please let us know that as well.

Happy hacking!

Sometimes an idea that’s been staring you in the face has to jump up and down and wave its hands to get attention.

I was working with Katrin Fischer, Koha’s QA manager, who had just finished putting together a fresh Koha testing environment on her laptop so that she can do patch review during KohaCon’s hackfest. She mentioned wishing that something like MarcEdit were on her laptop so that she could quickly edit some records for testing. While MarcEdit could be run under WINE or Mono or in a Windows virtual machine, inspiration struck me: with a little help, vim makes a perfectly good basic MARC editor.

Here’s how — if you start with a file of MARC records, you can convert them to a text file using yaz-marcdump:

The resulting text file will look something like this:

To edit the records on the command line, you can use vim (or whatever your favorite text editor is). When you’re done, to convert them back to MARC, use

To avoid mangling special characters, it’s helpful to use UTF8 as the character encoding. yaz-marcdump can also be used to convert a MARC file to UTF8. For example, if the original MARC file uses the MARC-8 encoding, you could do:

Not particularly profound, perhaps — and the title of this post is a bit tongue-in-cheek — but I know that this technique will save me a bit of time.

One number I quite like today is 99. That’s the difference between the count of explicitly enumerated tests in Koha’s master branch as of 19 May (1,837) and the count today (1,936)1. So far in the 3.14 cycle, eleven people have contributed patches that touch t/.

In particular, there’s been quite a bit of work on the database-dependent test suite that has increased both its coverage and its usability. Database-dependent test cases are useful for several reasons. First, a good bit of Koha’s code simply cannot be tested under realistic conditions if a Koha database isn’t available to talk to; while DBD::Mock can be use to mock query responses, it can be tedious to write the mocks. Second, test scripts that can use a database can readily exercise not just individual routines, but higher-level workflows. For example, it would be feasible to write a set of tests that creates a loan, renews it, simulates it becoming overdue, charges overdue fines, then returns the loan. In turn, being able to test larger sequences of actions can make it easier to avoid cases where a seemingly innocuous change to one core routine has an unanticipated effect elsewhere. This consideration particularly matters for Koha’s circulation code.

The automated buildbot has been running the DB-dependent tests for some time, but it’s historically been a dicier proposition for the average Koha hacker to run them on their own development databases. On the one hand, you probably don’t want to risk letting a test case mess up your database. On the other hand, some of the test cases make assumptions about the initial state of the database that may be unwarranted.

Although letting the buildbot do its thing is good, test cases are most useful if developers are able and willing to run them at any time. Worrying about damage to your development DB or having to figure out fiddly preconditions both decrease the probability that the tests will be run.

Recently, a simple “trick” has been adopted to deal with the first concern: make each DB-dependent test script operate in a transaction that gets rolled back. This is simple to set up:

The trick lies in setting AutoCommit on the database handle to 0. Setting RaiseError will cause the test script to abort if a fatal SQL error is raised. The $dbh->rollback() at the end is optional; if you let the script fall through to the end, or if the script terminates unexpectedly, the transaction will get rolled back regardless.

Doing all of the tests inside of a transaction grants you … freedom. Testing circulation policies? You can empty out issuingrules, set up a set of test policies, run through the variations, then end the test script confident that your original loan rules will be back in place.

It also grants you ease. Although it’s a good idea for Koha to let you easily run the tests in a completely fresh database, test cases that can run in your main development database are even better.

This ties into the second concern, which is being addressed by an ongoing project which Jonathan Druart and others have been working on to make each test script create the test data it needs. For example, if a test script needs a patron record, it will add it rather than assume that the database contains one. The DB-dependent tests currently do make a broader assumption that some of the English-language sample data has been loaded (most notably the sample libraries), but I’m confident that that will be resolved by the time 3.14 is released.

I’m seeing a virtuous cycle starting to develop: the safer it gets for Koha devs to run the tests, the more that they will be run — and the more that will get written. In turn, the more test coverage we achieve, the more confidently we can do necessary refactoring. In addition, the more tests we have, the more documentation — executable documentation! — we’ll have of Koha’s internals.


[1] From the top of Koha’s source tree, egrep -ro 'tests => [0-9]+' t |awk '{print $3}'|paste -d+ -s |bc