Category Archives: Code4Lib

Scaling the annual Code4Lib conference

One of the beautiful things about Code4Lib qua banner is that it can be easily taken up by anyway without asking permission.

If I wanted to, I could have lunch with a colleague, talk about Evergreen, and call it a Code4Lib meetup, and nobody could gainsay me — particularly if I wrote up a summary of what we talked about.

Three folks in a coffeehouse spending an afternoon hacking together a connection between digital repository Foo and automatic image metadata extractor Bar, then tossing something up on the Code4Lib Wiki? Easy-peasy.

Ten people for dinner and plotting to take over the world replace MARC once and for all? Probably should make a reservation at the restaurant.

Afternoon workshop for 20 in your metro area? Well, most libraries have meeting rooms, integral classrooms, or computer labs— and directors willing to let them be used for the occasional professional development activity.

Day and a half conference for 60 from your state, province, or region? That’s probably a bit more than you can pull off single-handedly, and you may well simply not have the space for it if you work for a small public library. You at least need to think about how folks will get meals and find overnight accommodations.

The big one? The one that nowadays attracts over four hundred people from across the U.S. and Canada, with a good sprinkling of folks from outside North America — and expects that for a good chunk of the time, they’ll all be sitting in the same room? And that also expects that at least half of them will spend a day scattered across ten or twenty rooms for pre-conference workshops? That folks unable to be there in person expect to live-stream? That tries in more and more ways tries to lower barriers to attending it?

Different kettle of fish entirely.

The 2017 conference incurred a tick under $240,000 in expenses. The 2016 conference: a bit over $207,000. This year? At the moment, projected expenses are in the neighborhood of $260,000.

What is this going towards? Convention center or hotel conference space rental and catering (which typically need to be negotiated together, as guaranteeing enough catering revenue and/or hotel nights often translates into “free” room rental). A/V services, including projectors, sound systems, and microphones. Catering and space rental for the reception. For the past few years, the services of a professional event management firm — even with 50+ people volunteering for Code4Lib conference committees, we need the professionals as well. Diversity scholarships, including travel expenses, forgone registration fees, and hotel nights. T-shirts. Gratuities. Live transcription services.

How is this all getting paid for? Last year, 49% of the income came from conference and pre-conference registrations, 31% from sponsorships and exhibitor tables, 5% from donations and sponsorships for scholarships, and 3% from hotel rebates and room credits.

The other 12%? That came from the organizers of the 2016 conference in Philadelphia, who passed along a bit under $33,000 to the 2017 LPC. The 2017 conference in turn was able to pass along a bit over $25,000 to the organizers of the forthcoming 2018 conference.

In other words, the 2017 conference effectively operated at a loss of a bit under $8,000, although fortunately there was enough of a cushion that from UCLA’s perspective, the whole thing was a wash — if you ignore some things. Things like the time that UCLA staff who were members of the 2017 local planning committee spent on the whole effort — and time spent by administrative staff in UCLA’s business office.

What are their names? I have no clue.

But something I can say much more confidently: every member of the 2017 LPC and budget committees lost sleep pondering what might happen if things went wrong. If we didn’t get enough sponsorships. If members of the community would balk at the registration fee — or simply be unable to afford it — and we couldn’t meet our hotel room night commitments.

I can also say, without direct knowledge this time, but equally confidently, that members of the 2016 organizers lost sleep. And 2015. And so on down the line.

While to my knowledge no Code4Lib member has ever been personally liable for the hotel contracts, I leave it to folks to consider the reputational consequence of telling their employer, were a conference to fail, that that institution is on the hook for potentially tens of thousands of dollars.

Of course, somebody could justly respond by citing an ancient joke. You know, the one that begins like this: “Doctor, it hurts when I do this!”.

And that’s a fair point. It is both a strength and weakness of Code4Lib that it imposes no requirement that anybody do anything in particular. We don’t have to have a big annual conference; a lot of good can be done under the Code4Lib banner via electronic communications and in-person meetups small enough that it’s of little consequence if nobody happens to show up.

But I also remember the days when the Code4Lib conference would open registration, then close it a couple hours later because capacity has been reached. Based on the attendance trends, we know that we can reasonably count on at least 400 people being willing to travel to attend the annual conference.  If a future LPC manages to make the cost of attending the conference signficantly lower, I could easily see 500 or 600 people showing up (although I would then wonder if we might hit some limits on how large a single-track conference can be and still remain relevant for all of the attendees),

I think there is value in trying to put on a conference that brings in as many practitioners (and yes, managers) in the GLAM technology space together in person as can come while also supporting online participation — but puts control of the program in the hands of the attendees via a process that both honors democracy and invites diversity of background and viewpoint.

Maybe you agree with that—and maybe you don’t. But even if you don’t agree, please do acknowledge the astonishing generosity of the people and institutions that have put their money and reputation on the line to host the annual conference over the years.

Regardless, if Code4Lib is to continue to hold a large annual conference while not being completely dependent on the good graces of a  small set of libraries that are in a position to assume $250,000+ liabilities, the status quo is not sustainable.

That brings me to the Fiscal Continuity Interest Group, which I have helped lead. If you care about the big annual conference, please read the report (and if you’re pressed for time, start with the summary of options), then vote. You have until 23:59 ET on Friday, November 3 to respond to the survey.

The survey offers the following options:

  • maintain the status quo, meaning that each potential conference host is ultimately responsible for deciding how the liability of holding the conference should be managed
  • set up a non-profit organization
  • pick among four institutions that have generously offered to consider acting as ongoing fiscal sponsors for the annual conference

I believe that moving away from the status quo will help ensure that the big annual Code4Lib conference can keep happening while broadening the number of institutions that would be able to physically host it. Setting up some kind of ongoing fiscal existence for Code4Lib may also solve some problems for the folks who have been running the Code4Lib Journal.

I also believe that continuing with the status quo necessarily means that the Code4Lib community must rethink the annual conference: whether to keep having it at all; to accept the fact that only a few institutions are nowadays capable of hosting it at the scale we’re accustomed to; and to accept that if an institution is nonetheless willing to host it, that we should scale back expectations that the community is entitled to direct the shape of the conference once a host has been selected.

In other words, it boils down to deciding how we wish to govern ourselves. This doesn’t mean that Code4Lib needs to embrace bureaucracy… but we must either accept some ongoing structure or scale back.

Choose wisely.

On triviality

Just now I read a blog post by a programmer whose premise was that it would be “almost trivial” to do something — and proceeded to roll my eyes.

However, it then occurred to me to interrogate my reaction a little. Why u so cranky, Galen?

On the one hand, the technical task in question, while certainly not trivial in the sense that it would take an inexperienced programmer just a couple minutes to come up with a solution, is in fact straightforward enough. Writing new software to do the task would require no complex math — or even any math beyond arithmetic. It could reasonably be done in a variety of commonly known languages, and there are several open source projects in the problem space that could be used to either build on or crib from. There are quite a few potential users of the new software, many of who could contribute code and testing, and the use cases are generally well understood.

On the other hand (and one of the reasons why I rolled my eyes), the relative ease of writing the software masks, if not the complexity of implementing it, the effort that would be required to do so. The problem domain would not be well served by a thrown-over-the-wall solution; it would take continual work to ensure that configurations would continue to work and that (more importantly) the software would be as invisible as possible to end users. Sure, the problem domain is in crying need of a competitor to the current bad-but-good-enough tool, but new software is only the beginning.

Why? Some things that are not trivial, even if the coding is:

  • Documentation, particularly on how to switch from BadButGoodEnough.
  • Community-building, with all the emotional labor entailed therein.

On the gripping hand: I nonetheless can’t completely dismiss appeals to triviality. Yes, calling something trivial can overlook the non-coding working required to make good software actually succeed. It can sometimes hide a lack of understanding of the problem domain; it can also set the coder against the user when the user points out complications that would interfere with ease of coding. The phrase “trivial problem” can also be a great way to ratchet up folks’ imposter syndrome.

But, perhaps, it can also encourage somebody to take up the work: if a problem is trivial, maybe I can tackle it. Maybe you can too. Maybe coming up with an alternative to BadButGoodEnoughProgram is within reach.

How can we better talk about such problems — to encourage folks to both acknowledge that often the code is only the beginning, while not loading folks down with so many caveats and considerations that only the more privileged among us feel empowered to make the attempt to tackle the problem?

Visualizing the global distribution of Evergreen installations from tarballs

In August I made a map of Koha installations based on geolocation of the IP addresses that retrieved the Koha Debian package. Here’s an equivalent map for Evergreen:

Downloads of Evergreen tarballs in past 52 weeks

Click to get larger image

As with the Koha map, this is based on the last 52 weeks of Apache logs as of the date of this post. I included only complete downloads of Evergreen ILS tarballs and excluded downloads done by web crawlers.  A total of 1,317 downloads from 838 distinct IP addresses met these criteria.

The interactive version can be found on Plotly.

Visualizing the global distribution of Koha installations from Debian packages

A picture is worth a thousand words:

Downloads of Koha Debian packages in past 52 weeks

Click to get larger image.

This represents the approximate geographic distribution of downloads of the Koha Debian packages over the past year. Data was taken from the Apache logs from debian.koha-community.org, which MPOW hosts. I counted only completed downloads of the koha-common package, of which there were over 25,000.

Making the map turned out to be an opportunity for me to learn some Python. I first adapted a Python script I found on Stack Overflow to query freegeoip.net and get the latitude and longitude corresponding to each of the 9,432 distinct IP addresses that had downloaded the package.

I then fed the results to OpenHeatMap. While that service is easy to use and is written with GPL3 code, I didn’t quite like the fact that the result is delivered via an Adobe Flash embed.  Consequently, I turned my attention to Plotly, and after some work, was able to write a Python script that does the following:

  1. Fetch the CSV file containing the coordinates and number of downloads.
  2. Exclude as outliers rows where a given IP address made more than 100 downloads of the package during the past year — there were seven of these.
  3. Truncate the latitude and longitude to one decimal place — we need not pester corn farmers in Kansas for bugfixes.
  4. Submit the dataset to Plotly with which to generate a bubble map.

Here’s the code:

An interactive version of the bubble map is also available on Plotly.

What makes the annual Code4Lib conference special?

There’s now a group of people taking a look at whether and how to set up some sort of ongoing fiscal entity for the annual Code4Lib conference.  Of course, one question that comes to mind is why go to the effort? What makes the annual Code4Lib conference so special?

There are lot of narratives out there about how the Code4Lib conference and the general Code4Lib community has helped people, but for this post I want to focus on the conference itself. What does the conference do that is unique or uncommon? Is there anything that it does that would be hard to replicate under another banner? Or to put it another way, what makes Code4Lib a good bet for a potential fiscal host — or something worth going to the effort of forming a new non-profit organization?

A few things that stand out to me as distinctive practices:

  • The majority of presentations are directly voted upon by the people who plan to attend (or who are at least invested enough in Code4Lib as a concept to go to the trouble of voting).
  • Similarly, keynote speakers are nominated and voted upon by the potential attendees.
  • Each year potential attendees vote on bids by one or more local groups for the privilege of hosting the conference.
  • In principle, most any aspect of the structure of the conference is open to discussion by the broader Code4Lib community — at any time.
  • Historically, any surplus from a conference has been given to the following year’s host.
  • Any group of people wanting to go to the effort can convene a local or regional Code4Lib meetup — and need not ask permission of anybody to do so.

Some practices are not unique to Code4Lib, but are highly valued:

  • The process for proposing a presentation or a preconference is intentionally light-weight.
  • The conference is single-track; for the most part, participants are expected to spend most of each day in the same room.
  • Preconferences are inexpensive.

Of course, some aspects of Code4Lib aren’t unique. The topic area certainly isn’t; library technology is not suffering any particular lack of conferences. While I believe that Code4Lib was one of the first libtech conferences to carve out time for lightning talks, many conferences do that nowadays. Code4Lib’s dependence on volunteer labor certainly isn’t unique, although putting aside keynote speakers) Code4Lib may be unique in having zero paid staff.

Code4Lib’s practice of requiring local hosts to bootstrap their fiscal operations from ground zero might be unique, as is the fact that its planning window does not extend much past 18 months. Of course, those are both arguably misfeatures that having fiscal continuity could alleviate.

Overall, the result has been a success by many measures. Code4Lib can reliably attract at least 400 or 500 attendees. Given the notorious registration rush each fall, it could very likely be larger. With its growth, however, come substantially higher expectations placed on the local hosts, and rather larger budgets — which circles us right back to the question of fiscal continuity.

I’ll close with a question: what have I missed? What makes Code4Lib qua annual conference special?

Update 2016-06-29: While at ALA Annual, I spoke with someone who mentioned another distinctive aspect of the conference: the local host is afforded broad latitude to run things as they see fit; while there is a set of lore about running the event and several people who have been involved in multiple conferences, there is no central group that dictates arrangements.  For example, while a couple recent conferences have employed a professional conference organizer, there’s nothing stopping a motivated group from doing all of the work on their own.

Code4Lib and the “open source way”

The question of what Code4Lib wants to be when it grows up seems to be perennial, and the latest iteration of the discussion is upon us. Quoting Christina Salazar:

… I really do think it’s time to reopen the question of formalizing Code4Lib IF ONLY FOR THE PURPOSES OF BEING THE FIDUCIARY AGENT for the annual conference.

I agree — we need to discuss this. The annual main conference has grown from a hundred or so in 2006 to 440 in 2016. Given the notorious rush of folks racing to register to attend each fall, it is not unreasonable to think that a conference in the right location that offered 750 seats — or even 1,000 — would still sell out. There are also over a dozen regional Code4Lib groups that have held events over the years.

With more attendees comes greater responsibilities — and greater financial commitments. Furthermore, over the years the bar has (appropriately) been raised on what is counted as the minimum responsibilities of the conference organizers. It is no longer enough to arrange to keep the bandwidth high, the latency low, and the beer flowing. A conference host that does not consider accessibility and representation is not living up to what Code4Lib qua group of thoughtful GLAM tech people should be; a host that does not take attendee safety and the code of conduct seriously is being dangerously irresponsible.

Running a conference or meetup that’s larger than what can fit in your employer’s conference room takes money — and the costs scale faster than linearly.  For recent Code4Lib conferences, the budgets have been in the low- to-middle- six figures.

That’s a lot of a money — and a lot of antacids consumed until the hotel and/or convention center minimums are met. The Code4Lib community has been incredibly lucky that a number of people have voluntarily chosen to take this stress on — and that a number of institutions have chosen to act as fiscal hosts and incur the risk of large payouts if a conference were to collapse.

To disclose: I am a member of the committee that worked on the erstwhile bid to host the 2017 conference in Chattanooga. I think we made the right decision to suspend our work; circumstances are such that many attendees would be faced with the prospect of traveling to a state whose legislature is actively trying to make it more dangerous to be there.

However, the question of building or finding a long-term fiscal host for the annual Code4Lib conference must be considered separately from the fate of the 2017 Chattanooga bid. Indeed, it should have been discussed before conference hosts found themselves transferring five-figure sums to the next year’s host.

Of course, one option is to scale back and cease attempting to organize a big international conference unless some big-enough institution happens to have the itch to backstop one. There is a lot of life in the regional meetings, and, of course, many, many people who will never get funding to attend a national conference but who could attend a regional one.

But I find stepping back like that unsatisfying. Collectively, the Code4Lib community has built an annual tradition of excellent conferences. Furthermore, those conference have gotten better (and bigger) over the years without losing one of the essences of Code4Lib: that any person who cares to share something neat about GLAM technology can have the respectful attention of their peers. In fact, the Code4Lib community has gotten better — by doing a lot of hard work — about truly meaning “any person.”

Is Code4Lib a “do-ocracy”? Loaded question, that. But this go around, there seems to be a number of people who are interested in doing something to keep the conference going in the long run. I feel we should not let vague concerns about “too much formality” or (gasp! horrors!) “too much library organization” stop the folks who are interested from making a serious go of it.

We may find out that forming a new non-profit is too much uncompensated effort. We may find out that we can’t find a suitable umbrella organization to join. Or we may find out that we can keep the conference going on a sounder fiscal basis by doing the leg-work — and thereby free up some people’s time to hack on cool stuff without having to pop a bunch of Maalox every winter.

But there’s one in argument against “formalizing” in particular that I object to. Quoting Eric Lease Morgan:

In the spirit of open source software and open access publishing, I suggest we
earnestly try to practice DIY — do it yourself — before other types of
formalization be put into place.

In the spirit of open source? OK, clearly that means that we should immediately form a non-profit foundation that can sustain nearly USD 16 million in annual expenses. Too ambitious?  Let’s settle for just about a million in annual expenses.

I’m not, of course, seriously suggesting that Code4Lib aim to form a foundation that’s remotely in the same league as the Apache Software Foundation or the Mozilla Foundation. Nor do I think Code4Lib needs to become another LITA — we’ve already got one of those (though I am proud, and privileged, to count myself a member of both).  For that matter, I do think it is possible for a project or group effort to prematurely spend too much time adopting the trappings of formal organizational structure and thus forget to actually do something.

But the sort of “DIY” (and have fun unpacking that!) mode that Morgan is suggesting is not the only viable method of “open source” organization. Sometimes open source projects get bigger. When that happens, the organizational structure always changes; it’s better if that change is done openly.

The Code4Lib community doesn’t have to grow larger; it doesn’t have to keep running a big annual conference. But if we do choose to do that — let’s do it right.

Securing Z39.50 traffic from Koha and Evergreen Z39.50 servers using YAZ and TLS

There’s often more than way to search a library catalog; or to put it another way, not all users come in via the front door.  For example, ensuring that your public catalog supports HTTPS can help prevent bad actors from snooping on patron’s searches — but if one of your users happens to use a tool that searches your catalog over Z39.50, by default they have less protection.

Consider this extract from a tcpdump of a Z39.50 session:

No, MARC is not a cipher; it just isn’t.

How to improve this state of affairs? There was some discussion back in 2000 of bundling SSL or TLS into the Z39.50 protocol, although it doesn’t seem like it went anywhere. Of course, SSH tunnels and stunnel are options, but it turns out that there can be an easier way.

As is usually the case with anything involving Z39.50, we can thank the folks at IndexData for being on top of things: it turns out that TLS support is easily enabled in YAZ. Here’s how this can be applied to Evergreen and Koha.

The first step is to create an SSL certificate; a self-signed one probably suffices. The certificate and its private key should be concatenated into a single PEM file, like this:

-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
-----BEGIN PRIVATE KEY-----
...
-----END PRIVATE KEY-----

Evergreen’s Z39.50 server can be told to require SSL via a <listen> element in /openils/conf/oils_yaz.xml, like this:

To supply the path to the certificate, a change to oils_ctl.sh will do the trick:

For Koha, a <listen> element should be added to koha-conf.xml, e.g.,

zebrasrv will also need to know how to find the SSL certificate:

And with that, we can test: yaz-client ssl:localhost:4210/CONS or yaz-client ssl:localhost:4210/biblios. Et voila!

Of course, not every Z39.50 client will know how to use TLS… but lots will, as YAZ is the basis for many of them.

Three tales regarding a decrease in the number of catalogers

Discussions on Twitter today – see the timelines of @cm_harlow and @erinaleach for entry points – got me thinking.

In 1991, the Library of Congress had 745 staff in its Cataloging Directorate. By the end of FY 2004, the LC Bibliographic Access Divisions had between 5061 and 5612 staff.

What about now? As of 2014, the Acquisitions and Bibliographic Access unit has 238 staff3.

While I’m sure one could quibble about the details (counting FTE vs. counting humans, accounting for the reorganizations, and so forth), the trend is clear: there has been a precipitous drop in the number of cataloging staff employed by the Library of Congress.

I’ll blithely ignore factors such as shifts in the political climate in the U.S. and how they affect civil service. Instead, I’ll focus on library technology, and spin three tales.

The tale of the library technologists

The decrease in the number of cataloging staff are one consequence of a triumph of library automation. The tools that we library technologists have written allow catalogers to work more efficiently. Sure, there are fewer of them, but that’s mostly been due to retirements. Not only that, the ones who are left are now free to work on more intellectually interesting tasks.

If we, the library technologists, can but slip the bonds of legacy cruft like the MARC record, we can make further gains in the expressiveness of our tools and the efficiencies they can achieve. We will be able to take advantage of metadata produced by other institutions and people for their own ends, enabling library metadata specialists to concern themselves with larger-scale issues.

Moreover, once our data is out there – who knows what others, including our patrons, can achieve with it?

This will of course be pretty disruptive, but as traditional library catalogers retire, we’ll reach buy-in. The library administrators have been pushing us to make more efficient systems, though we wish that they would invest more money in the systems departments.

We find that the catalogers are quite nice to work with one-on-one, but we don’t understand why they seem so attached to an ancient format that was only meant for record interchange.

The tale of the catalogers

The decrease in the number of cataloging staff reflects a success of library administration in their efforts to save money – but why is it always at our expense? We firmly believe that our work with the library catalog/metadata services counts as a public service, and we wish more of our public services colleagues knew how to use the catalog better.  We know for a fact that what doesn’t get catalogued may as well not exist in the library.

We also know that what gets catalogued badly or inconsistently can cause real problems for patrons trying to use the library’s collection.  We’ve seen what vendor cataloging can be like – and while sometimes it’s very good, often it’s terrible.

We are not just a cost center. We desperately want better tools, but we also don’t think that it’s possible to completely remove humans from the process of building and improving our metadata. 

We find that the library technologists are quite nice to work with one-on-one – but it is quite rare that we get to actually speak with a programmer.  We wish that the ILS vendors would listen to us more.

The tale of the library directors

The decrease in the number of cataloging staff at the Library of Congress is only partially relevant to the libraries we run, but hopefully somebody has figured out how to do cataloging more cheaply. We’re trying to make do with the money we’re allocated. Sometimes we’re fortunate enough to get a library funding initiative passed, but more often we’re trying to make do with less: sometimes to the point where flu season makes us super-nervous about our ability to keep all of the branches open.

We’re concerned not only with how much of our budgets are going into electronic resources, but with how nigh-impossible it is to predict increases in fees for ejournal subscriptions/ fees for ebook services.

We find that the catalogers and the library technologists are pleasant enough to talk to, but we’re not sure how well they see the big picture – and we dearly wish they could clearly articulate how yet another cataloging standard / yet another systems migration will make our budgets any more manageable.

Each of these tales is true. Each of these tales is a lie. Many other tales could be told. Fuzziness abounds.

However, there is one thing that seems clear: conversations about the future of library data and library systems involve people with radically different points of view. These differences do not mean that any of the people engaged in the conversations are villains, or do not care about library users, or are unwilling to learn new things.

The differences do mean that it can be all too easy for conversations to fall apart or get derailed.

We need to practice listening.

1. From testmony by the president of the Library of Congress Professional Guild to Congress on 6 March 2015.
2. From the BA FY 2004 report. This including 32 staff from the Cataloging Distribution Service, which had been merged into BA and had not been part of the Cataloging Directorate.
3. From testmony by the president of the Library of Congress Professional Guild to Congress on 6 March 2015.

Henriette Avram versus the world: Is COBOL capable of processing MARC?

Is the COBOL programming language capable of processing MARC records?

A computer programmer in 2015 could be excused for thinking to herself, what kind of question is that!?! Surely it’s obvious that any programming language capable of receiving input can parse a simple, antique record format?

In 1968, it apparently wasn’t so obvious. I turned up an article by Henriette Avram and a colleague, MARC II and COBOL, that was evidently written in response to a review article by a Hillis Griffin where he stated

Users will require programmers skilled in languages other than FORTRAN or COBOL to take advantage of MARC records.

Avram responded to Griffin’s concern in the most direct way possible: by describing COBOL programs developed by the Library of Congress to process MARC records and generate printed catalogs. Her article even include source code, in case there were any remaining doubts!

I haven’t yet turned up any evidence that Henriette Avram and Grace Hopper ever met, but it was nice to find a close, albeit indirect connection between the two of them via COBOL.

Is the debate between Avram and Griffen in 1968 regarding COBOL and MARC anything more than a curiosity? I think it is — many of the discussions she participated in are reminiscent of debates that are taking place now. To fair to Griffin, I don’t know enough about the computing environment of the late sixties to be able to definitely say that his statement was patently ill-informed at the time — but given that by 1962 IBM had announced that they were standardizing on COBOL, it seems hardly surprising that Avram and her group would be writing MARC processing code in COBOL on an IBM/360 by 1968. To me, the concerns that Griffin raised seem on par with objections to Library Linked Data that assume that each library catalog request would necessarily mean firing off a dozen requests to RDF providers — objections that have rejoinders that are obvious to programmers, but perhaps not so obvious to others.

Plus ça change, plus c’est la même chose?

Notes on making my WordPress blog HTTPS-only

The other day I made this blog, galencharlton.com/blog/, HTTPS-only.  In other words, if Eve want to sniff what Bob is reading on my blog, she’ll need to do more than just capture packets between my blog and Bob’s computer to do so.

This is not bulletproof: perhaps Eve is in possession of truly spectacular computing capabilities or a breakthrough in cryptography and can break the ciphers. Perhaps she works for any of the sites that host external images, fonts, or analytics for my blog and has access to their server logs containing referrer headers information.  Currently these sites are Flickr (images), Gravatar (more images), Google (fonts) or WordPress (site stats – I will be changing this soon, however). Or perhaps she’s installed a keylogger on Bob’s computer, in which case anything I do to protect Bob is moot.

Or perhaps I am Eve and I’ve set up a dastardly plan to entrap people by recording when they read about MARC records, then showing up at Linked Data conferences and disclosing that activity.  Or vice versa. (Note: I will not actually do this.)

So, yes – protecting the privacy of one’s website visitors is hard; often the best we can do is be better at it than we were yesterday.

To that end, here are some notes on how I made my blog require HTTPS.

Certificates

I got my SSL certificate from Gandi.net. Why them?  Their price was OK, I already register my domains through them, and I like their corporate philosophy: they support a number of free and open source software projects; they’re not annoying about up-selling, and they have never (to my knowledge) run sexist advertising, unlikely some of their larger and more well-known competitors. But there are, of course, plenty of options for getting SSL certificates, and once Let’s Encrypt is in production, it should be both cheaper and easier for me to replace the certs next year.

I have three subdomains of galencharlton.com that I wanted a certificate for, so I decided to get a multi-domain certificate.  I consulted this tutorial by rtCamp to generate the CSR.

After following the tutorial to create a modified version of openssl.conf specifying the subjectAltName values I needed, I generated a new private key and a certificate-signing request as follows:

The openssl command asked me a few questions; the most important of which being the value to set the common name (CN) field; I used “galencharlton.com” for that, as that’s the primary domain that the certificate protects.

I then entered the text of the CSR into a form and paid the cost of the certificate.  Since I am a library techie, not a bank, I purchased a domain-validated certificate.  That means that all I had to prove to the certificate’s issuer that I had control of the three domains that the cert should cover.  That validation could have been done via email to an address at galencharlton.com or by inserting a special TXT field to the DNS zone file for galencharlton.com. I ended up choosing to go the route of placing a file on the web server whose contents and location were specified by the issuer; once they (or rather, their software) downloaded the test files, they had some assurance that I had control of the domain.

In due course, I got the certificate.  I put it and the intermediate cert specified by Gandi in the /etc/ssl/certs directory on my server and the private key in /etc/private/.

Operating System and Apache configuration

Various vulnerabilities in the OpenSSL library or in HTTPS itself have been identified and mitigated over the years: suffice it to say that it is a BEASTly CRIME to make a POODLE suffer a HeartBleed — or something like that.

To avoid the known problems, I wanted to ensure that I had a recent enough version of OpenSSL on the web server and had configured Apache to disable insecure protocols (e.g., SSLv3) and eschew bad ciphers.

The server in question is running Debian Squeeze LTS, but since OpenSSL 1.0.x is not currently packaged for that release, I ended up adding Wheezy to the APT repositories list and upgrading the openssl and apache2 packages.

For the latter, after some Googling I ended up adapting the recommended Apache SSL virtualhost configuration from this blog post by Tim Janik.  Here’s what I ended up with:

I also wanted to make sure that folks coming in via old HTTP links would get permanently redirected to the HTTPS site:

Checking my work

I’m a big fan of the Qualsys SSL Labs server test tool, which does a number of things to test how well a given website implements HTTPS:

  • Identifying issues with the certificate chain
  • Whether it supports vulnerable protocol versions such as SSLv3
  • Whether it supports – and request – use of sufficiently strong ciphers.
  • Whether it is vulnerable to common attacks.

Suffice it to say that I required a couple iterations to get the Apache configuration just right.

WordPress

To be fully protected, all of the content embedded on a web page served via HTTPS must also be served via HTTPS.  In other words, this means that image URLs should require HTTPS – and the redirects in the Apache config are not enough.  Here is the sledgehammer I used to update image links in the blog posts:

Whee!

I also needed to tweak a couple plugins to use HTTPS rather than HTTP to embed their icons or fetch JavaScript.

Finishing touches

In the course of testing, I discovered a couple more things to tweak:

  • The web sever had been using Apache’s mod_php5filter – I no longer remember why – and that was causing some issues when attempting to load the WordPress dashboard.  Switching to mod_php5 resolved that.
  • My domain ownership proof on keybase.io failed after the switch to HTTPS.  I eventually tracked that down to the fact that keybase.io doesn’t have a bunch of intermediate certificates in its certificate store that many browsers do. I resolved this by adding a cross-signed intermediate certificate to the file referenced by SSLCertificateChainFile in the Apache config above.

My blog now has an A+ score from SSL Labs. Yay!  Of course, it’s important to remember that this is not a static state of affairs – another big OpenSSL or HTTPS protocol vulnerability could turn that grade to an F.  In other words, it’s a good idea to test one’s website periodically.