Discussions on Twitter today – see the timelines of @cm_harlow and @erinaleach for entry points – got me thinking.

In 1991, the Library of Congress had 745 staff in its Cataloging Directorate. By the end of FY 2004, the LC Bibliographic Access Divisions had between 5061 and 5612 staff.

What about now? As of 2014, the Acquisitions and Bibliographic Access unit has 238 staff3.

While I’m sure one could quibble about the details (counting FTE vs. counting humans, accounting for the reorganizations, and so forth), the trend is clear: there has been a precipitous drop in the number of cataloging staff employed by the Library of Congress.

I’ll blithely ignore factors such as shifts in the political climate in the U.S. and how they affect civil service. Instead, I’ll focus on library technology, and spin three tales.

The tale of the library technologists

The decrease in the number of cataloging staff are one consequence of a triumph of library automation. The tools that we library technologists have written allow catalogers to work more efficiently. Sure, there are fewer of them, but that’s mostly been due to retirements. Not only that, the ones who are left are now free to work on more intellectually interesting tasks.

If we, the library technologists, can but slip the bonds of legacy cruft like the MARC record, we can make further gains in the expressiveness of our tools and the efficiencies they can achieve. We will be able to take advantage of metadata produced by other institutions and people for their own ends, enabling library metadata specialists to concern themselves with larger-scale issues.

Moreover, once our data is out there – who knows what others, including our patrons, can achieve with it?

This will of course be pretty disruptive, but as traditional library catalogers retire, we’ll reach buy-in. The library administrators have been pushing us to make more efficient systems, though we wish that they would invest more money in the systems departments.

We find that the catalogers are quite nice to work with one-on-one, but we don’t understand why they seem so attached to an ancient format that was only meant for record interchange.

The tale of the catalogers

The decrease in the number of cataloging staff reflects a success of library administration in their efforts to save money – but why is it always at our expense? We firmly believe that our work with the library catalog/metadata services counts as a public service, and we wish more of our public services colleagues knew how to use the catalog better.  We know for a fact that what doesn’t get catalogued may as well not exist in the library.

We also know that what gets catalogued badly or inconsistently can cause real problems for patrons trying to use the library’s collection.  We’ve seen what vendor cataloging can be like – and while sometimes it’s very good, often it’s terrible.

We are not just a cost center. We desperately want better tools, but we also don’t think that it’s possible to completely remove humans from the process of building and improving our metadata. 

We find that the library technologists are quite nice to work with one-on-one – but it is quite rare that we get to actually speak with a programmer.  We wish that the ILS vendors would listen to us more.

The tale of the library directors

The decrease in the number of cataloging staff at the Library of Congress is only partially relevant to the libraries we run, but hopefully somebody has figured out how to do cataloging more cheaply. We’re trying to make do with the money we’re allocated. Sometimes we’re fortunate enough to get a library funding initiative passed, but more often we’re trying to make do with less: sometimes to the point where flu season makes us super-nervous about our ability to keep all of the branches open.

We’re concerned not only with how much of our budgets are going into electronic resources, but with how nigh-impossible it is to predict increases in fees for ejournal subscriptions/ fees for ebook services.

We find that the catalogers and the library technologists are pleasant enough to talk to, but we’re not sure how well they see the big picture – and we dearly wish they could clearly articulate how yet another cataloging standard / yet another systems migration will make our budgets any more manageable.

Each of these tales is true. Each of these tales is a lie. Many other tales could be told. Fuzziness abounds.

However, there is one thing that seems clear: conversations about the future of library data and library systems involve people with radically different points of view. These differences do not mean that any of the people engaged in the conversations are villains, or do not care about library users, or are unwilling to learn new things.

The differences do mean that it can be all too easy for conversations to fall apart or get derailed.

We need to practice listening.

1. From testmony by the president of the Library of Congress Professional Guild to Congress on 6 March 2015.
2. From the BA FY 2004 report. This including 32 staff from the Cataloging Distribution Service, which had been merged into BA and had not been part of the Cataloging Directorate.
3. From testmony by the president of the Library of Congress Professional Guild to Congress on 6 March 2015.

Just now I read a comment on a blog to the effect that although the commenter was mildly interested in something, she wasn’t going to Google it.

The something in question was a racist cultural practice that used to be common, but nowadays is condemned by most.

Why not search to learn more about it? She wasn’t particularly concerned that somebody would go through her web search history and think she endorsed the practice. Rather, she didn’t want to contribute even so much as a click to making the practice have higher visibility on the web.

Google searches ≠ endorsement, of course – but I find it interesting that nowadays some find it necessary to say so explicitly.

Is the COBOL programming language capable of processing MARC records?

A computer programmer in 2015 could be excused for thinking to herself, what kind of question is that!?! Surely it’s obvious that any programming language capable of receiving input can parse a simple, antique record format?

In 1968, it apparently wasn’t so obvious. I turned up an article by Henriette Avram and a colleague, MARC II and COBOL, that was evidently written in response to a review article by a Hillis Griffin where he stated

Users will require programmers skilled in languages other than FORTRAN or COBOL to take advantage of MARC records.

Avram responded to Griffin’s concern in the most direct way possible: by describing COBOL programs developed by the Library of Congress to process MARC records and generate printed catalogs. Her article even include source code, in case there were any remaining doubts!

I haven’t yet turned up any evidence that Henriette Avram and Grace Hopper ever met, but it was nice to find a close, albeit indirect connection between the two of them via COBOL.

Is the debate between Avram and Griffen in 1968 regarding COBOL and MARC anything more than a curiosity? I think it is — many of the discussions she participated in are reminiscent of debates that are taking place now. To fair to Griffin, I don’t know enough about the computing environment of the late sixties to be able to definitely say that his statement was patently ill-informed at the time — but given that by 1962 IBM had announced that they were standardizing on COBOL, it seems hardly surprising that Avram and her group would be writing MARC processing code in COBOL on an IBM/360 by 1968. To me, the concerns that Griffin raised seem on par with objections to Library Linked Data that assume that each library catalog request would necessarily mean firing off a dozen requests to RDF providers — objections that have rejoinders that are obvious to programmers, but perhaps not so obvious to others.

Plus ça change, plus c’est la même chose?

The other day I made this blog, galencharlton.com/blog/, HTTPS-only.  In other words, if Eve want to sniff what Bob is reading on my blog, she’ll need to do more than just capture packets between my blog and Bob’s computer to do so.

This is not bulletproof: perhaps Eve is in possession of truly spectacular computing capabilities or a breakthrough in cryptography and can break the ciphers. Perhaps she works for any of the sites that host external images, fonts, or analytics for my blog and has access to their server logs containing referrer headers information.  Currently these sites are Flickr (images), Gravatar (more images), Google (fonts) or WordPress (site stats – I will be changing this soon, however). Or perhaps she’s installed a keylogger on Bob’s computer, in which case anything I do to protect Bob is moot.

Or perhaps I am Eve and I’ve set up a dastardly plan to entrap people by recording when they read about MARC records, then showing up at Linked Data conferences and disclosing that activity.  Or vice versa. (Note: I will not actually do this.)

So, yes – protecting the privacy of one’s website visitors is hard; often the best we can do is be better at it than we were yesterday.

To that end, here are some notes on how I made my blog require HTTPS.

Certificates

I got my SSL certificate from Gandi.net. Why them?  Their price was OK, I already register my domains through them, and I like their corporate philosophy: they support a number of free and open source software projects; they’re not annoying about up-selling, and they have never (to my knowledge) run sexist advertising, unlikely some of their larger and more well-known competitors. But there are, of course, plenty of options for getting SSL certificates, and once Let’s Encrypt is in production, it should be both cheaper and easier for me to replace the certs next year.

I have three subdomains of galencharlton.com that I wanted a certificate for, so I decided to get a multi-domain certificate.  I consulted this tutorial by rtCamp to generate the CSR.

After following the tutorial to create a modified version of openssl.conf specifying the subjectAltName values I needed, I generated a new private key and a certificate-signing request as follows:

openssl req -new -key galencharlton.com.key \
  -out galencharlton.com.csr \
  -config galencharlton.com.cnf \
  -sha256

The openssl command asked me a few questions; the most important of which being the value to set the common name (CN) field; I used “galencharlton.com” for that, as that’s the primary domain that the certificate protects.

I then entered the text of the CSR into a form and paid the cost of the certificate.  Since I am a library techie, not a bank, I purchased a domain-validated certificate.  That means that all I had to prove to the certificate’s issuer that I had control of the three domains that the cert should cover.  That validation could have been done via email to an address at galencharlton.com or by inserting a special TXT field to the DNS zone file for galencharlton.com. I ended up choosing to go the route of placing a file on the web server whose contents and location were specified by the issuer; once they (or rather, their software) downloaded the test files, they had some assurance that I had control of the domain.

In due course, I got the certificate.  I put it and the intermediate cert specified by Gandi in the /etc/ssl/certs directory on my server and the private key in /etc/private/.

Operating System and Apache configuration

Various vulnerabilities in the OpenSSL library or in HTTPS itself have been identified and mitigated over the years: suffice it to say that it is a BEASTly CRIME to make a POODLE suffer a HeartBleed — or something like that.

To avoid the known problems, I wanted to ensure that I had a recent enough version of OpenSSL on the web server and had configured Apache to disable insecure protocols (e.g., SSLv3) and eschew bad ciphers.

The server in question is running Debian Squeeze LTS, but since OpenSSL 1.0.x is not currently packaged for that release, I ended up adding Wheezy to the APT repositories list and upgrading the openssl and apache2 packages.

For the latter, after some Googling I ended up adapting the recommended Apache SSL virtualhost configuration from this blog post by Tim Janik.  Here’s what I ended up with:

<VirtualHost _default_:443>
    ServerAdmin gmc@galencharlton.com
    DocumentRoot /var/www/galencharlton.com
    ServerName galencharlton.com
    ServerAlias www.galencharlton.com

    SSLEngine on
    SSLCertificateFile /etc/ssl/certs/galencharlton.com.crt
    SSLCertificateChainFile /etc/ssl/certs/GandiStandardSSLCA2.pem
    SSLCertificateKeyFile /etc/ssl/private/galencharlton.com.key
    Header add Strict-Transport-Security "max-age=15552000"

    # No POODLE
    SSLProtocol all -SSLv2 -SSLv3 +TLSv1.1 +TLSv1.2
    SSLHonorCipherOrder on
    SSLCipherSuite "EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+
aRSA+SHA384 EECDH+aRSA+SHA256 EECDH+AESGCM EECDH EDH+AESGCM EDH+aRSA HIGH !MEDIUM !LOW !aNULL !eNULL
!LOW !RC4 !MD5 !EXP !PSK !SRP !DSS"

</VirtualHost>

I also wanted to make sure that folks coming in via old HTTP links would get permanently redirected to the HTTPS site:

<VirtualHost *:80>
    ServerName galencharlton.com
    Redirect 301 / https://galencharlton.com/
</VirtualHost>

<VirtualHost *:80>
    ServerName www.galencharlton.com
    Redirect 301 / https://www.galencharlton.com/
</VirtualHost>

Checking my work

I’m a big fan of the Qualsys SSL Labs server test tool, which does a number of things to test how well a given website implements HTTPS:

  • Identifying issues with the certificate chain
  • Whether it supports vulnerable protocol versions such as SSLv3
  • Whether it supports – and request – use of sufficiently strong ciphers.
  • Whether it is vulnerable to common attacks.

Suffice it to say that I required a couple iterations to get the Apache configuration just right.

WordPress

To be fully protected, all of the content embedded on a web page served via HTTPS must also be served via HTTPS.  In other words, this means that image URLs should require HTTPS – and the redirects in the Apache config are not enough.  Here is the sledgehammer I used to update image links in the blog posts:

create table bkp_posts as select * from wp_posts;

begin;
update wp_posts set post_content = replace(post_content, 'http://galen', 'https://galen') where post_content like '%http://galen%';
commit;

Whee!

I also needed to tweak a couple plugins to use HTTPS rather than HTTP to embed their icons or fetch JavaScript.

Finishing touches

In the course of testing, I discovered a couple more things to tweak:

  • The web sever had been using Apache’s mod_php5filter – I no longer remember why – and that was causing some issues when attempting to load the WordPress dashboard.  Switching to mod_php5 resolved that.
  • My domain ownership proof on keybase.io failed after the switch to HTTPS.  I eventually tracked that down to the fact that keybase.io doesn’t have a bunch of intermediate certificates in its certificate store that many browsers do. I resolved this by adding a cross-signed intermediate certificate to the file referenced by SSLCertificateChainFile in the Apache config above.

My blog now has an A+ score from SSL Labs. Yay!  Of course, it’s important to remember that this is not a static state of affairs – another big OpenSSL or HTTPS protocol vulnerability could turn that grade to an F.  In other words, it’s a good idea to test one’s website periodically.

At the first face-to-face meeting of the LITA Patron Privacy Technologies Interest Group at Midwinter, one of the attendees mentioned that they had sent out an RFP last year for library databases. One of the questions on the RFP asked how user passwords were stored — and a number of vendors responded that their systems stored passwords in plain text.

Here’s what I tweeted about that, and here is Dorothea Salo’s reply:

https://twitter.com/LibSkrat/status/561605951656976384

This is a repeatable response, by the way — much like the way a hammer strike to the patellar ligament instigates a reflexive kick, mention of plain-text password storage will trigger an instinctual wail from programmers, sysadmins, and privacy and security geeks of all stripes.

Call it the Vanilla Password Reflex?

I’m not suggesting that you should whisper “plain text passwords” into the ear of your favorite system designer, but if you are the sort to indulge in low and base amusements…

A recent blog post by Eric Hellman discusses the problems with storing passwords in plain text in detail. The upshot is that it’s bad practice — if a system’s password list is somehow leaked, and if the passwords are stored in plain text, it’s trivially easy for a cracker to use those passwords to get into all sorts of mischief.

This matters, even “just” for library reference databases. If we take the right to reader privacy seriously, it has to extend to the databases offered by the library — particularly since many of them have features to store citations and search results in a user’s account.

As Eric mentions, the common solution is to use a one-way cryptographic hash function to transform the user’s password into a bunch of gobbledegook.

For example, “p@ssw05d” might be stored as the following hash:

d242b6313f32c8821bb75fb0660c3b354c487b36b648dde2f09123cdf44973fc

To make it more secure, I might add some random salt and end up with the following salted hash:

$2355445aber$76b62e9b096257ac4032250511057ac4d146146cdbfdd8dd90097ce4f170758a

To log in, the user has to prove that they know the password by supplying it, but rather than compare the password directly, the result of the one-way function applied to the password is compared with the stored hash.

How is this more secure? If a hacker gets the list of password hashes, they won’t be able to deduce the passwords, assuming that the hash function is good enough. What counts as good enough? Well, relatively few programmers are experts in cryptography, but suffice it to say that there does exist a consensus on techniques for managing passwords and authentication.

The idea of one-way functions to encrypt passwords is not new; in fact, it dates back to the 1960s. Nowadays, any programmer who wants to be considered a professional really has no excuse for writing a system that stores passwords in plain text.

Back to the “Vanilla Password Reflex”. It is, of course, not actually a reflex in the sense of an instinctual response to a stimulus — programmers and the like get taught, one way or another, about why storing plain text passwords is a bad idea.

Where does this put the public services librarian? Particularly the one who has no particular reason to be well versed in security issues?

At one level, it just changes the script. If a system is well-designed, if a user asks what their password is, it should be impossible to get an answer to the question. How to respond to a patron who informs you that they’ve forgotten their password? Let them know that you can change it for them. If they respond by wondering why you can’t just tell them, if they’re actually interested in the answer, tell them about one-way functions — or just blame the computer, that’s fine too if time is short.

However, libraries and librarians can have a broader role in educating patrons about online security and privacy practices: leading by example. If we insist that the online services we recommend follow good security design; if we use HTTPS appropriately; if we show that we’re serious about protecting reader privacy, it can only buttress programming that the library may offer about (say) using password managers or avoiding phishing and other scams.

There’s also a direct practical benefit: human nature being what it is, many people use the same password for everything. If you crack an ILS’s password list, you’ve undoubtedly obtained a non-negligible set of people’s online banking passwords.

I’ll end this with a few questions. Many public services librarians have found themselves, like it or not, in the role of providing technical support for e-readers, smartphones, and laptops. How often does online security come up during such interactions? How often to patrons come to the library seeking help against the online bestiary of spammers, phishers, and worse? What works in discussing online security with patrons, who of course can be found at all levels of computer savvy? And what doesn’t?

I invite discussion — not just in the comments section, but also on the mailing list of the Patron Privacy IG.

Shortly after it came to light that Adobe Digital Editions was transmitting information about ebook reading activity in the clear, for anybody to snoop upon, I asked a loaded question: does ALA have a role in helping to verify that the software libraries use protect the privacy of readers?

As with any loaded question, I had an answer in mind: I do think that ALA and LITA, by virtue of their institutional heft and influence with librarians, can provide significant assistance in securing library software.

I waited a bit, wondering how the powers that be at ALA would respond. Then I remembered something: an institution like ALA is not, in fact, a faceless, inscrutable organism. Like Soylent Green, ALA is people!

Well, maybe not so much like Soylent Green. My point is that despite ALA’s reputation for being a heavily bureaucratic, procedure-bound organization, it does offer ways for members to take up and idea an run with it.

And that’s what I did — I floated a petition to form a new interest group within LITA, the Patron Privacy Technologies IG. Quite a few people signed it… and it now lives!

Here’s the charge of the IG:

The LITA Patron Privacy Technologies Interest Group will promote the design and implementation of library software and hardware that protects the privacy of library users and maximizes user ability to make informed decisions about the use of personally identifiable information by the library and its vendors.

Under this remit, activities of the Interest Group would include, but are not necessarily limited to:

  1. Publishing recommendations on data security practices for library software.
  2. Publishing tutorials on tools for libraries to use to check that library software is handling patron information responsibly.
  3. Organizing efforts to test commercially available software that handle patron information.
  4. Providing a conduit for responsible disclosure of defects in software that could lead to exposure of library patron information.
  5. Providing sample publicity materials for libraries to use with their patrons in explaining the library’s privacy practices.

I am fortunate to have two great co-chairs, Emily Morton-Owens of the Seattle Public Library and Matt Beckstrom of the Lewis and Clark Library, and I’m happy to announce that the IG’s first face-to-face meeting will at ALA Midwinter 2015 — specifically  tomorrow, at 8:30 a.m. Central Time in the Ballroom 1 of the Sheraton in Chicago.

We have two great speakers lined up — Alison Macrina of the Library Freedom Project and Gary Price of INFODocket, and I’m very much looking forward to it.

But I’m also looking forward to the rest of the meeting: this is when the IG will, as a whole, decide how far to reach.  We have a lot of interest and the ability to do things that will teach library staff and our patrons how to better protect privacy, teach library programmers how to design and code for privacy, and verify that our tools match our ideals.

Despite the title of this blog post… it’s by no means my effort alone that will get us anywhere. Many people are already engaging in issues of privacy and technology in libraries, but I do hope that the IG will provide one more point of focus for our efforts.

I look forward to the conversation tomorrow.

I recently circulated a petition to start a new interest group within LITA, to be called the Patron Privacy Technologies IG.  I’ve submitted the formation petition to the LITA Council, and a vote on the petition is scheduled for early November.  I also held an organizational meeting with the co-chairs; I’m really looking forward to what we all can do to help improve how our tools protect patron privacy.

But enough about the IG, let’s talk about the petition! To be specific, let’s talk about when the signatures came in.

I’ve been on Twitter since March of 2009, but a few months ago I made the decision to become much more active there (you see, there was a dearth of cat pictures on Twitter, and I felt it my duty to help do something about it).  My first thought was to tweet the link to a Google Form I created for the petition. I did so at 7:20 a.m. Pacific Time on 15 October:

Since I wanted to gauge whether there was interest beyond just LITA members, I also posted about the petition on the ALA Think Tank Facebook group at 7:50 a.m. on the 15th.

By the following morning, I had 13 responses: 7 from LITA members, and 6 from non-LITA members. An interest group petition requires 10 signatures from LITA members, so at 8:15 on the 16th, I sent another tweet, which got retweeted by LITA:

By early afternoon, that had gotten me one more signature. I was feeling a bit impatient, so at 2:28 p.m. on the 16th, I sent a message to the LITA-L mailing list.

That opened the floodgates: 10 more signatures from LITA members arrived by the end of the day, and 10 more came in on the 17th. All told, a total of 42 responses to the form were submitted between the 15th and the 23rd.

The petition didn’t ask how the responder found it, but if I make the assumption that most respondents filled out the form shortly after they first heard about it, I arrive at my bit of anecdata: over half of the petition responses were inspired by my post to LITA-L, suggesting that the mailing list remains an effective way of getting the attention of many LITA members.

By the way, the petition form is still up for folks to use if they want to be automatically subscribed to the IG’s mailing list when it gets created.

Yesterday I did some testing of version 4.0.1 of Adobe Digital Editions and verified that it is now using HTTPS when sending ebook usage data to Adobe’s server adelogs.adobe.com.

Of course, because the HTTPS protocol encrypts the datastream to that server, I couldn’t immediately verify that ADE was sending only the information that the privacy statement says it is.

Emphasis is on the word “immediately”.  If you want to find out what a program is sending via HTTPS to a remote server, there are ways to get in the middle.  Here’s how I did this for ADE:

  1. I edited the hosts file to refer “adelogs.adobe.com” to the address of a server under my control.
  2. I used the CA.pl script from openssl to create a certificate authority of my very own, then generated an SSL certificate for “adelogs.adobe.com” signed by that CA.
  3. I put the certificate for my new certificate authority into the trusted root certificates store on my Windows 7 deskstop.
  4. I put the certificate in place on my webserver and wrote a couple simple CGI scripts to emulate the ADE logging data collector and capture what got sent to them.

I then started up ADE and flipped through a few pages of an ebook purchased from Kobo.  Here’s an example of what is now getting sent by ADE (reformatted a bit for readability):

"id":"F5hxneFfnj/dhGfJONiBeibvHOIYliQzmtOVre5yctHeWpZOeOxlu9zMUD6C+ExnlZd136kM9heyYzzPt2wohHgaQRhSan/hTU+Pbvo7ot9vOHgW5zzGAa0zdMgpboxnhhDVsuRL+osGet6RJqzyaXnaJXo2FoFhRxdE0oAHYbxEX3YjoPTvW0lyD3GcF2X7x8KTlmh+YyY2wX5lozsi2pak15VjBRwl+o1lYQp7Z6nbRha7wsZKjq7v/ST49fJL",
"h":"4e79a72e31d24b34f637c1a616a3b128d65e0d26709eb7d3b6a89b99b333c96e",
"d":[  
   {  
      "d":"ikN/nu8S48WSvsMCQ5oCrK+I6WsYkrddl+zrqUFs4FSOPn+tI60Rg9ZkLbXaNzMoS9t6ACsQMovTwW5F5N8q31usPUo6ps9QPbWFaWFXaKQ6dpzGJGvONh9EyLlOsbJM"
   },
   {  
      "d":"KR0EGfUmFL+8gBIY9VlFchada3RWYIXZOe+DEhRGTPjEQUm7t3OrEzoR3KXNFux5jQ4mYzLdbfXfh29U4YL6sV4mC3AmpOJumSPJ/a6x8xA/2tozkYKNqQNnQ0ndA81yu6oKcOH9pG+LowYJ7oHRHePTEG8crR+4u+Q725nrDW/MXBVUt4B2rMSOvDimtxBzRcC59G+b3gh7S8PeA9DStE7TF53HWUInhEKf9KcvQ64="
   },
   {  
      "d":"4kVzRIC4i79hhyoug/vh8t9hnpzx5hXY/6g2w8XHD3Z1RaCXkRemsluATUorVmGS1VDUToDAvwrLzDVegeNmbKIU/wvuDEeoCpaHe+JOYD8HTPBKnnG2hfJAxaL30ON9saXxPkFQn5adm9HG3/XDnRWM3NUBLr0q6SR44bcxoYVUS2UWFtg5XmL8e0+CRYNMO2Jr8TDtaQFYZvD0vu9Tvia2D9xfZPmnNke8YRBtrL/Km/Gdah0BDGcuNjTkHgFNph3VGGJJy+n2VJruoyprBA0zSX2RMGqMfRAlWBjFvQNWaiIsRfSvjD78V7ofKpzavTdHvUa4+tcAj4YJJOXrZ2hQBLrOLf4lMa3N9AL0lTdpRSKwrLTZAFvGd8aQIxL/tPvMbTl3kFQiM45LzR1D7g=="
   },
   {  
      "d":"bSNT1fz4szRs/qbu0Oj45gaZAiX8K//kcKqHweUEjDbHdwPHQCNhy2oD7QLeFvYzPmcWneAElaCyXw+Lxxerht+reP3oExTkLNwcOQ2vGlBUHAwP5P7Te01UtQ4lY7Pz"
   }
]

In other words, it’s sending JSON containing… I’m not sure.

The values of the various keys in that structure are obviously Base 64-encoded, but when run through a decoder, the result is just binary data, presumably the result of another layer of encryption.

Thus, we haven’t actually gotten much further towards verifying that ADE is sending only the data they claim to.  That packet of data could be describing my progress reading that book purchased from Kobo… or it could be sending something else.

That extra layer of encryption might be done as protection against a real man-in-the-middle attack targeted at Adobe’s log server — or it might be obfuscating something else.

Either way, the result remains the same: reader privacy is not guaranteed. I think Adobe is now doing things a bit better than they were when they released ADE 4.0, but I could be wrong.

If we as library workers are serious about protection patron privacy, I think we need more than assurances — we need to be able to verify things for ourselves. ADE necessarily remains in the “unverified” column for now.

A couple hours ago, I saw reports from Library Journal and The Digital Reader that Adobe has released version 4.0.1 of Adobe Digital Editions.  This was something I had been waiting for, given the revelation that ADE 4.0 had been sending ebook reading data in the clear.

ADE 4.0.1 comes with a special addendum to Adobe’s privacy statement that makes the following assertions:

  • It enumerates the types of information that it is collecting.
  • It states that information is sent via HTTPS, which means that it is encrypted.
  • It states that no information is sent to Adobe on ebooks that do not have DRM applied to them.
  • It may collect and send information about ebooks that do have DRM.

It’s good to test such claims, so I upgraded to ADE 4.0.1 on my Windows 7 machine and my OS X laptop.

First, I did a quick check of strings in the ADE program itself — and found that it contained an instance of “https://adelogs.adobe.com/” rather than “http://adelogs.adobe.com/”.  That was a good indication that ADE 4.0.1 was in fact going to use HTTPS to send ebook reading data to that server.

Next, I fired up Wireshark and started ADE.  Each time it started, it contacted a server called adeactivate.adobe.com, presumably to verify that the DRM authorization was in good shape.  I then opened and flipped through several ebooks that were already present in the ADE library, including one DRM ebook I had checked out from my local library.

So far, it didn’t send anything to adelogs.adobe.com.  I then checked out another DRM ebook from the library (in this case, Seattle Public Library and its OverDrive subscription) and flipped through it.  As it happens, it still didn’t send anything to Adobe’s logging server.

Finally, I used ADE to fulfill a DRM ePub download from Kobo.  This time, after flipping through the book, it did send data to the logging server.  I can confirm that it was sent using HTTPS, meaning that the contents of the message were encrypted.

To sum up, ADE 4.0.1’s behavior is consistent with Adobe’s claims – the data is no longer sent in the clear and a message was sent to the logging server only when I opened a new commercial DRM ePub.  However, without decrypting the contents of that message, I cannot verify that it only information about that ebook from Kobo.

But even then… why should Adobe be logging that information about the Kobo book? I’m not aware that Kobo is doing anything fancy that requires knowledge of how many pages I read from a book I purchased from them but did not open in the Kobo native app.  Have they actually asked Adobe to collect that information for them?

Another open question: why did opening the library ebook in ADE not trigger a message to the logging server?  Is it because the fulfillmentType specified in the .acsm file was “loan” rather than “buy”? More clarity on exactly when ADE sends reading progress to its logging server would be good.

Finally, if we take the privacy statement at its word, ADE is not implementing a page synchronization feature as some, including myself, have speculated – at least not yet.  Instead, Adobe is gathering this data to “share anonymous aggregated information with eBook providers to enable billing under the applicable pricing model”.  However, another sentence in the statement is… interesting:

While some publishers and distributors may charge libraries and resellers for 30 days from the date of the download, others may follow a metered pricing model and charge them for the actual time you read the eBook.

In other words, if any libraries are using an ebook lending service that does have such a metered pricing model, and if ADE is sending reading progress information to an Adobe server for such ebooks, that seems like a violation of reader privacy. Even though the data is now encrypted, if an Adobe ID is used to authorize ADE, Adobe itself has personally identifying information about the library patron and what they’re reading.

Adobe appears to have closed a hole – but there are still important questions left open. Librarians need to continue pushing on this.

Here is a partial list of various ways I can think of to expose information about library patrons and their search and reading history by use (and misuse) of software used or recommended by libraries.

  • Send a patron’s ebook reading history to a commercial website…
    • … in the clear, for anybody to intercept.
  • Send patron information to a third party…
    • … that does not have an adequate privacy policy.
    • … that has an adequate privacy policy but does not implement it well.
    • … that is sufficiently remote that libraries lack any leverage to punish it for egregious mishandling of patron data.
  • Use an unencrypted protocol to enable a third-party service provider to authenticate patrons or look them up…
    • … such as SIP2.
    • … such as SIP2, with the patron information response message configured to include full contact information for the patron.
    • … or many configurations of NCIP.
    • … or web services accessible over HTTP (as opposed to HTTPS).
  • Store patron PINs and passwords without encryption…
    • … or using weak hashing.
  • Store the patron’s Social Security Number in the ILS patron record.
  • Don’t require HTTPS for a patron to access her account with the library…
    • … or if you do, don’t keep up to date with the various SSL and TLS flaws announced over the years.
  • Make session cookies used by your ILS or discovery layer easy to snoop.
  • Use HTTP at all in your ILS or discovery layer – as oddly enough, many patrons will borrow the items that they search for.
  • Send an unencrypted email…
    • … containing a patron’s checkouts today (i.e., an email checkout receipt).
    • … reminding a patron of his overdue books – and listing them.
    • … listing the titles of the patron’s available hold requests.
  • Don’t encrypt connections between an ILS client program and its application server.
  • Don’t encrypt connections between an ILS application server and its database server.
  • Don’t notice that a rootkit has been running on your ILS server for the past six months.
  • Don’t notice that a keylogger has been running on one of your circulation PCs for the past three months.
  • Fail to keep up with installing operating system security patches.
  • Use the same password for the circulator account used by twenty circulation staff (and 50 former circulation staff) – and never change it.
  • Don’t encrypt your backups.
  • Don’t use the feature in your ILS to enable severing the link between the record of a past loan and the specific patron who took the item out…
    • … sever the links, but retain database backups for months or years.
  • Don’t give your patrons the ability to opt out of keeping track of their past loans.
  • Don’t give your patrons the ability to opt in to keeping track of their past loans.
  • Don’t give the patron any control or ability to completely sever the link between her record and her past circulation history whenever she chooses to.
  • When a patron calls up asking “what books do I have checked out?” … answer the question without verifying that the patron is actually who she says she is.
  • When a parent calls up asking “what books does my teenager have checked out?”… answer the question.
  • Set up your ILS to print out hold slips… that include the full name of the patron. For bonus points, do this while maintaining an open holds shelf.
  • Don’t shred any circulation receipts that patrons leave behind.
  • Don’t train your non-MLS staff on the importance of keeping patron information confidential.
  • Don’t give your MLS staff refreshers on professional ethics.
  • Don’t shut down library staff gossiping about a patron’s reading preferences.
  • Don’t immediately sack a library staff member caught misusing confidential patron information.
  • Have your ILS or discovery interface hosted by a service provider that makes one or more of the mistakes listed above.
  • Join a committee writing a technical standard for library software… and don’t insist that it take patron privacy into account.

Do you have any additions to the list? Please let me know!

Of course, I am not actually advocating disclosing confidential information. Stay tuned for a follow-up post.