Category Archives: Libraries

Data cleanup as a force for evil

A quotidian concern of anybody responsible for a database is the messy data it contains. See a record about a Pedro González? Bah, the assumption of Latin-1 strikes again! Better correct it to González. Looking at his record in the first place because you’re reading his obituary? Oh dear, better mark him as deceased. 12,741 people living in the bungalow at 123 Main St.? Let us now ponder the wisdom of the null and the foolishness of the dummy value.

Library name authority control could be viewed as a grand collaborative data cleanup project without having to squint too hard.

What of the morality of data cleanup? Let’s assume that the data should be gathered in the first place; then as Patricia Hayes noted back in 2004, there is of course an ethical expectation that efforts such as medical research will be based on clean data: data that has been carefully collected under systematic supervision.

Let’s consider another context: whether to engage in batch authority cleanup of a library catalog. The decision of whether it is worth the cost, like most decisions on allocating resources, has an ethical dimension: does the improvement in the usefulness of the catalog outweigh the benefits of other potential uses of the money? Sometimes yes, sometimes no, and the decision often depends on local factors, but generally there’s not much examination of the ethics of the data cleanup per se. After all, if you should have the database in the first place, it should be as accurate and precise as you can manage consistent with its raison d’être.

Now let’s consider a particular sort of database. One full of records about people. Specifically, a voter registration database.  There are many like it; after all, at its heart it’s just a slightly overgrown list of names and addresses.

An overgrown list of names of addresses around which much mischief has been done in the name of accuracy.

This is on my mind because the state I live in, Georgia, is conducting a gubernatorial election that just about doubles as a referendum on how to properly maintain a voter registration list.

On the one hand, you have Brian Kemp, the current Georgia secretary of state, whose portfolio includes the office that maintains the statewide voter database and oversees all elections. On other hand, Stacey Abrams, who among other things founded the New Georgia Project aimed at registering tens of thousands of new voters, albeit with mixed results.

Is it odd for somebody to oversee the department that would certify the winner of the governor’s race? The NAACP and others think so, having filed a lawsuit to try to force Kemp to step down as secretary of state. Moreover, Kemp has a history of efforts to “clean” the voter rolls; efforts that tend to depress votes by minorities—in a state that is becoming increasingly purple.  (And consider the county I live in, Gwinnett County. It is the most demographically diverse county in the southeast… and happens to have the highest rate of rejection of absentee ballots so far this year.) Most recently, the journalist Greg Palast published a database of voters purged from Georgia’s list. This database contains 591,000 names removed from the rolls in 2017… one tenth of the list!

A heck of a data cleanup project, eh?

Every record removal that prevents a voter from casting their ballot on election day is an injustice. Every one of the 53,000 voters whose registration is left pending due to the exact match law is suffering an injustice. Hopefully they won’t be put off and will vote… if they can produce ID… if the local registrar discretion leans towards expanding and not collapsing the franchise.

Dare I say it? Data cleanup is not an inherently neutral endeavor.

Sure, much of the time data cleanup work is just improving the accuracy of a database—but not always. If you work with data about people, be wary.

On being wrong, wrong, wrong

Yesterday I gave a lightning talk at the Evergreen conference on being wrong. Appropriately, I started out the talk on the wrong foot. I intended to give the talk today, but when I signed up for a slot, I failed to notice that the signup sheet I used was for yesterday. It was a good thing that I had decided to listen to the other lightning talks yesterday, as that way the facilitator was able to find me to tell me that I was up next.

Oops.

When she did that, I initially asked to do it today as I had intended… but changed my mind and decided to charge ahead. Lightning talks are all about serendipity, right?

The talk went something like this: after mentioning my scheduling mix-up, I spoke about how I have been active in the Evergreen project for almost nine years. I’ve worn a variety of project hats over that time, including those of developer, core committer, release manager, member of the Evergreen Oversight Board, chair of the EOB, and so forth. While I am of course proud of the contributions I’ve made, my history with the project also includes being wrong about many things and failing a lot.

I’ve been wrong about coding issues. I’ve been responsible for regressions. I’ve had my share of brown-bag releases. I’ve misunderstood what library staff and patrons were trying to accomplish. I’ve made assumptions about the working conditions and circumstances of users that were very wrong indeed. Some of my bug reports and test plans have not been particularly clear.

Why bring up my wrongness? Prior to the talk, I had been part of a couple conversations about how some folks feel intimidated about writing bug reports or posting to the mailing lists for fear of being judged if their submission was not perfect. Of course, I don’t want people to feel intimidated; the project needs bug reports and contributions from anybody who cares enough about the software to make the effort. By mentioning how I — as somebody who is unquestionably a senior contributor to the project — have been repeatedly wrong, I hoped to humanize people like me: we’re not perfect. Perfection is not a requirement for gaining status in the community as a respected contributor — and that’s a good thing.

I also wanted to give permission for folks to be wrong, in the hopes that doing so might help lower a barrier to participating.

So much for the gist of the lightning talk. People in the audience seemed to enjoy it, and I got a couple nice comments about it, including somebody mentioning how they wished they had heard something like that as they were making their first contributions to the project.

However, I would also like to expand a bit on a couple points.

Permission to be wrong is not something I can grant all by myself. While I can try to model good ways of providing feedback (and get better myself at it; I’ve certainly been wrong many a time about how to do so), it sometimes doesn’t take much for an interaction with a new contributor (or an experienced one!) to become unwelcoming to the point where we lose the contributor forever. This is not a theoretical concern; while I think we have gotten much better over the years, there were certainly times and circumstances where it was very rational to feel intimidated about participating in the project in certain ways for fear of getting dismissive feedback.

Giving ourselves permission to be wrong is a community responsibility; by doing so we can give ourselves permission to improve. However, this can’t be treated as a platitude: it takes effort and thoughtfulness both to ensure that the community is welcoming at all levels, and to ensure that permission to be wrong isn’t accorded only to people who look like me.

One of the things that the conference keynote speaker Crystal Martin asked the community to consider was this: “Lift as you climb.” I challenge senior contributors to the Evergreen project — including myself — to take this to heart. I have benefited a lot by being able to be wrong; we should act to ensure that everybody else in the community can be allowed to be wrong as well.

Fostering a habit of nondisclosure

It almost doesn’t need to be said that old-fashioned library checkout cards were terrible for patron privacy. Want to know who had checked out a book? Just take the card out of its pocket and read.

It’s also a trivial observation that there’s a mini-genre of news articles and social media posts telling the tales of prodigal books, returning to their library after years or decades away, usually having gathered nothing but dust.

Put these two together on a slow news day? Without care, you can end up not protecting a library user’s right to privacy and confidentially with respect to resources borrowed, to borrow some words from the ALA Code of Ethics.

Faced with this, one’s sense of proportion may ask, “so what?” The borrower of a book returned sixty years late is quite likely dead, and if alive, not likely to suffer any social opprobrium or even sixty years of accumulated overdue fines.  Even if the book in question was a copy of The Anarchist Cookbook, due back on Tuesday, 11 May 1976, the FBI no doubt has lost interest in the matter.

Of course, an immediate objection to that attitude is that personal harm to the patron remains possible, even if not probable. Sometimes the borrower wants to keep a secret to the grave. They may simply not care to be the subject of a local news story.

The potential for personal harm to the borrower is of course clearer if we consider more recent loans. It’s not the job of a librarian to out somebody who wishes to remain in the closet; it remains the case that somebody who does not care to have another snoop on their reading should be entitled to read, and think, in peace.

At this point, the sense of proportion that has somehow embodied itself in this post may rejoin, “you’re catastrophizing here, Charlton,” and not be entirely wrong. Inadvertent disclosure of patron information at the “retail” level does risk causing harm, but is not guaranteed to. After all, lots of people have no problem sharing (some) of their reading history. Otherwise, LibraryThing and Goodreads would just sit there gathering tumbleweeds.

I’d still bid that sense of proportion to shuffle off with this: it’s mostly not the librarians bearing the risk of harm.

However, there’s a larger point: libraries nowadays run much higher risks of violating patron privacy at the “wholesale” level than they used to.

Remember those old checkout cards? Back in the day, an outsider trying to get a borrower’s complete reading history might have to turn out every book in the library to do so. Today, it can be much easier: find a way in, and you can have everything (including driver’s license numbers, addressees, and, if the patrons are really ill-served by their library, SSNs).

That brings me to my point: we should care about nondisclosure (and better yet, non-collection of data we don’t need) at the retail level to help bolster a habit of caring about it at the wholesale level.

Imagine a library where people at every level can feel free to point out and correct patron privacy violations — and know that they should. Where the social media manager — whose degree may not be an MLS — redacts patron names and/or asks for permission every time.  Where, and more to my point, the director and the head of IT make technology choices that protect patron privacy — because they are in the habit of thinking about patron privacy in the first place.

This is why it’s worth it to sweat the small disclosures, to be better prepared against large ones.

Mashcat at ALA Annual 2017 + shared notes

I’m leaving for Chicago tomorrow to attend ALA Annual 2017 (and to eat some real pizza), and while going over the schedule I found some programs that may be of interest to Mashcat folk:

As a little experiment, I’ve started a Google Doc for shared notes about events and other goings-on at the conference. There will of course be a lot of coverage on social media about the conference, but the shared notes doc might be a way for Mashcatters to identify common themes.

What makes an anti-librarian?

Assuming the order gets made and shipped in time (update 2017-06-22: it did), I’ll be arriving in Chicago for ALA Annual carrying a few tens of badge ribbons like this one:

Am I hoping that the librarians made of anti-matter will wear these ribbons to identify themselves, thereby avoiding unpleasant explosions and gamma ray bursts? Not really. Besides, there’s an obvious problem with this strategy, were anti-matter librarians a real constituency at conferences.

No, in a roundabout way, I’m mocking this behavior by Jeffrey Beall:"This is fake news from an anti-librarian. Budget cuts affect library journal licensing much more than price hikes. #OA #FakeNewsJeffrey Beall added,"

Seriously, dude?

I suggest reading Rachel Walden’s tweets for more background, but suffice it to say that even if you were to discount Walden’s experience as a medical library director (which I do not), Beall’s response to her is extreme. (And for even more background, John Dupuis has an excellent compilation of links on recent discussions about Open Access and “predatory” journals.)

But I’d like to unpack Beall’s choice of the expression “anti-librarian”? What exactly makes for an anti-librarian?

We already have plenty of names for folks who oppose libraries and librarians. Book-burners. Censors. Austeritarians. The closed-minded. The tax-cutters-above-all-else. The drowners of governments in bathtubs. The fearful. We could have a whole taxonomy, in fact, were the catalogers to find a few spare moments.

“Anti-librarian” as an epithet doesn’t fit most of these folks. Instead, as applied to a librarian, it has some nasty connotations: a traitor. Somebody who wears the mantle of the profession but opposes its very existence. Alternatively: a faker. A purveyor of fake news. One who is unfit to participate in the professional discourse.

There may be some librarians who deserve to have that title — but it would take a lot more than being mistaken, or even woefully misguided to earn that.

So let me also protest Beall’s response to Walden explicitly:

It is not OK.

It is not cool.

It is not acceptable.

Continuing the lesson

The other day, school librarian and author Jennifer Iacopelli tweeted about her experience helping a student whose English paper had been vandalized by some boys. After she had left the Google Doc open in the library computer lab when she went home, they had inserted some “inappropriate” stuff. When she and her mom went to work on it later that evening, mom saw the insertions, was appalled, and grounded the student. Iacopelli, using security camera footage from the library’s computer lab, was able to demonstrate that the boys were responsible, with the result that the grounding was lifted and the boys suspended.

This story has gotten retweeted 1,300 times as of this writing and earned Iacopelli a mention as a “badass librarian” in HuffPo.

Before I continue, I want to acknowledge that there isn’t much to complain about regarding the outcome: justice was served, and mayhap the boys in question will think thrice before attacking the reputation of another or vandalizing their work.

Nonetheless, I do not count this as an unqualified feel-good story.

I have questions.

Was there no session management software running on the lab computers that would have closed off access to the document when she left at the end of the class period? If not, the school should consider installing some. On the other hand, I don’t want to hang too much on this pin; it’s possible that some was running but that a timeout hadn’t been reached before the boys got to the computer.

How long is security camera footage from the library computer lab retained? Based on the story, it sounds like it is kept at least 24 hours. Who, besides Iacopelli, can access it? Are there procedures in place to control access to it?

More fundamentally: is there a limit to how far student use of computers in that lab is monitored? Again, I do not fault the outcome in this case—but neither am I comfortable with Iacopelli’s embrace of surveillance.

Let’s consider some of the lessons learned. The victim learned that adults in a position of authority can go to bat for her and seek and acquire justice; maybe she will be inspired to help others in a similar position in the future. She may have learned a bit about version control.

She also learned that surveillance can protect her.

And well, yes. It can.

But I hope that the teaching continues—and not the hard way. Because there are other lessons to learn.

Surveillance can harm her. It can cause injustice, against her and others. Security camera footage sometimes doesn’t catch the truth. Logs can be falsified. Innocent actions can be misconstrued.

Her thoughts are her own.

And truly badass librarians will protect that.

How to build an evil library catalog

Consider a catalog for a small public library that features a way to sort search results by popularity. There are several ways to measure “popularity” of a book: circulations, hold requests, click-throughs in the catalog, downloads, patron-supplied ratings, place on bestseller lists, and so forth.

But let’s do a little thought experiment: let’s use a random number generator to calculate popularity.

However, the results will need to be plausible. It won’t do to have the catalog assert that the latest J.D. Robb book is gathering dust in the stacks. Conversely, the copy of 1959 edition of The geology and paleontology of the Elk Mountain and Tabernacle Butte area, Wyoming that was given to the library right after the last weeding is never going to be a doorbuster.

So let’s be clever and ensure that the 500 most circulated titles in the collection retain their expected popularity rating. Let’s also leave books that have never circulated alone in their dark corners, as well as those that have no cover images available. The rest, we leave to the tender mercies of the RNG.

What will happen? If patrons use the catalog’s popularity rankings, if they trust them — or at least are more likely to look at whatever shows up near the top of search results — we might expect that the titles with an artificial bump from the random number generator will circulate just a bit more often.

Of course, testing that hypothesis by letting a RNG skew search results in a real library catalog would be unethical.

But if one were clever enough to be subtle in one’s use of the RNG, the patrons would have a hard time figuring out that something was amiss.  From the user’s point of view, a sufficiently advanced search engine is indistinguishable from a black box.

This suggests some interesting possibilities for the Evil Librarian of Evil:

  • Some manual tweaks: after all, everybody really ought to read $BESTBOOK. (We won’t mention that it was written by the ELE’s nephew.)
  • Automatic personalization of search results. Does geolocation show that the patron’s IP address is on the wrong side of the tracks? Titles with a lower reading level just got more popular!
  • Has the patron logged in to the catalog? Personalization just got better! Let’s check the patron’s gender and tune accordingly!

Don’t be the ELE.

But as you work to improve library catalogs… take care not to become the ELE by accident.

What makes the annual Code4Lib conference special?

There’s now a group of people taking a look at whether and how to set up some sort of ongoing fiscal entity for the annual Code4Lib conference.  Of course, one question that comes to mind is why go to the effort? What makes the annual Code4Lib conference so special?

There are lot of narratives out there about how the Code4Lib conference and the general Code4Lib community has helped people, but for this post I want to focus on the conference itself. What does the conference do that is unique or uncommon? Is there anything that it does that would be hard to replicate under another banner? Or to put it another way, what makes Code4Lib a good bet for a potential fiscal host — or something worth going to the effort of forming a new non-profit organization?

A few things that stand out to me as distinctive practices:

  • The majority of presentations are directly voted upon by the people who plan to attend (or who are at least invested enough in Code4Lib as a concept to go to the trouble of voting).
  • Similarly, keynote speakers are nominated and voted upon by the potential attendees.
  • Each year potential attendees vote on bids by one or more local groups for the privilege of hosting the conference.
  • In principle, most any aspect of the structure of the conference is open to discussion by the broader Code4Lib community — at any time.
  • Historically, any surplus from a conference has been given to the following year’s host.
  • Any group of people wanting to go to the effort can convene a local or regional Code4Lib meetup — and need not ask permission of anybody to do so.

Some practices are not unique to Code4Lib, but are highly valued:

  • The process for proposing a presentation or a preconference is intentionally light-weight.
  • The conference is single-track; for the most part, participants are expected to spend most of each day in the same room.
  • Preconferences are inexpensive.

Of course, some aspects of Code4Lib aren’t unique. The topic area certainly isn’t; library technology is not suffering any particular lack of conferences. While I believe that Code4Lib was one of the first libtech conferences to carve out time for lightning talks, many conferences do that nowadays. Code4Lib’s dependence on volunteer labor certainly isn’t unique, although putting aside keynote speakers) Code4Lib may be unique in having zero paid staff.

Code4Lib’s practice of requiring local hosts to bootstrap their fiscal operations from ground zero might be unique, as is the fact that its planning window does not extend much past 18 months. Of course, those are both arguably misfeatures that having fiscal continuity could alleviate.

Overall, the result has been a success by many measures. Code4Lib can reliably attract at least 400 or 500 attendees. Given the notorious registration rush each fall, it could very likely be larger. With its growth, however, come substantially higher expectations placed on the local hosts, and rather larger budgets — which circles us right back to the question of fiscal continuity.

I’ll close with a question: what have I missed? What makes Code4Lib qua annual conference special?

Update 2016-06-29: While at ALA Annual, I spoke with someone who mentioned another distinctive aspect of the conference: the local host is afforded broad latitude to run things as they see fit; while there is a set of lore about running the event and several people who have been involved in multiple conferences, there is no central group that dictates arrangements.  For example, while a couple recent conferences have employed a professional conference organizer, there’s nothing stopping a motivated group from doing all of the work on their own.

Cataloging and coding as applied empathy: a Mashcat discussion prompt

Consider the phrase “Cataloging and coding as applied empathy”.  Here are some implications of those six words:

  • Catalogers and coders share something: what we build is mainly for use by other people, not ourselves. (Yes, programmers often try to eat our own dogfood, and catalogers tend to be library users, but that’s mostly not what we’re paid for.)
  • Consideration of the needs of our users is needed to do our jobs well, and to do right by our users.
  • However: we cannot rely on our users to always tell us what to do:
    • sometimes they don’t know what it is possible to want;
    • sometimes they can’t articulate what they want in a way that lends itself to direct translation to code or taxonomy;
    • it is rarely their paid job to tell us what they want, and how to build it.
  • Waiting for users to tell exactly us what to do can be a decision… to do nothing. Sometimes doing nothing is the best thing to do; often it’s not.
  • Therefore, catalogers and coders need to develop empathy.
  • Applied empathy: our catalogs and our software in some sense embody our empathy (or lack thereof).
  • Applied empathy: empathy can be a learned skill.

Is “applied empathy” a useful framework for discussing how to serve our users? I don’t know, so I’d like to chat about it.  I will be moderating a Mashcat Twitter chat on Thursday, 12 May 2016, at 20:30 UTC (time converter). Do you have questions to suggest? Please add them to the Google doc for this week’s chat.

Natural and unnatural problems in the domain of library software

I offer up two tendentious lists. First, some problems in the domain of library software that are natural to work on, and in the hopeful future, solve:

  • Helping people find stuff. On the one hand, this surely comes off as simplistic; on the other hand, it is the core problem we face, and has been the core problem of library technology from the very moment that a library’s catalog grew too large to stay in the head of one librarian.  There are of course a number of interesting sub-problems under this heading:
    • Helping people produce and maintain useful metadata.
    • Usefully aggregating metadata.
    • Helping robots find stuff (presumably with the ultimate purpose of helping people to find stuff).
    • Artificial intelligence. By this I’m not suggesting that library coders should be aiming to have an ILS kick off the Singularity, but there’s plenty of room for (e.g.) natural language processing to assist in the overall task of helping people find stuff.
  • Helping people evaluate stuff. “Too much information, little knowledge, less wisdom” is one way of describing the glut of bits infesting the Information Age. Libraries can help and should help—even though pitfalls abound.
  • Helping people navigate software and information resources. This includes UX for library software, but also a lot of other software that librarians, like it or not, find themselves helping patrons use. There are some areas of software engineering where the programmer can assume that the user is expert in the task that the software assists with; library software isn’t one of them.
  • Sharing stuff. What is Evergreen if not a decade-long project in figuring out ways to better share library materials among more users? Sharing stuff is not a solved problem even for digital stuff.
  • Keeping stuff around. This is an increasingly difficult problem. Time was, you could leave a pile of books sitting around and reasonably expect that at least a few would still exist five hundred years hence. Digital stuff never rewards that sort of carelessness.
  • Protecting patron privacy. This nearly ended up in the unnatural list—a problem can be unnatural but nonetheless crucial to work on. However, since there’s no reason to expect that people will stop being nosy about what other people are reading—and for that nosiness to sometimes turn into persecution—here we are.
  • Authentication. If the library keeps any transaction information on behalf of a patron so that they can get to it later, the software had better be trying to make sure that only the correct patron can see it. Of course, one could argue that library software should never store such information in the first place (after, say, a loan is returned), but I think there can be an honest conflict with patrons’ desires to keep track of what they used in the past.

Second, some distinctly unnatural problems that library technologists all too often must work on:

  • Digital rights management. If Ambrose Bierce were alive, I would like to think that he might define DRM in a library context thus: “Something that is ineffective in its stated purpose—and cannot possible be effective—but which serves to compromise libraries’ commitment to patron privacy in the pursuit of a misunderstanding about what will keep libraries relevant.”
  • Walled garden maintenance. Consider EZproxy. It takes the back of a very small envelope to realize that hundreds of thousands of person-hours have been expended fiddling with EZproxy configuration files for the sake of bolstering the balance sheets of Big Journal. Is this characterization unfair? Perhaps. Then consider this alternative formulation: the opportunity cost imposed by time spent maintaining or working around barriers to the free exchange of academic publications is huge—and unlike DRM for public library ebooks, there isn’t even a case (good, bad, or indifferent) to be made that the effort results in any concrete financial compensation to the academics who wrote the journal articles that are being so carefully protected.
  • Authorization. It’s one thing to authenticate a patron so that they can get at whatever information the library is storing on their behalf. It’s another thing to spend time coding authentication and authorization systems as part of maintaining the walled gardens.

The common element among the problems I’m calling unnatural? Copyright; in the particular, the current copyright regime that enforces the erection of barriers to sharing—and which we can imagine, if perhaps wistfully, changing to the point where DRM and walled garden maintenance need not occupy the attention of the library programmer, who then might find more time to work on some of the natural problems.

Why is this on my mind? I would like to give a shout-out to (and blow a raspberry at) an anonymous publisher who had this to say in a recent article about Sci-Hub:

And for all the researchers at Western universities who use Sci-Hub instead, the anonymous publisher lays the blame on librarians for not making their online systems easier to use and educating their researchers. “I don’t think the issue is access—it’s the perception that access is difficult,” he says.

I know lots of library technologists who would love to have more time to make library software easier to use. Want to help, Dear Anonymous Publisher? Tell your bosses to stop building walls.