Yesterday the Open Knowledge Foundation announced their principles of open bibliographic data. Following a definition of “bibliographic data” (though I don’t think that the distinction drawn between “core” and “secondary” data is useful here), the principles are
- When publishing bibliographic data make an explicit and robust license statement.
- Use a recognized waiver or license that is appropriate for data.
- If you want your data to be effectively used and added to by others it should be open as defined by the Open Definition (http://opendefinition.org) — in particular non-commercial and other restrictive clauses should not be used.
- Where possible, we recommend explicitly placing bibliographic data in the Public Domain via PDDL or CC0.
I have endorsed the principles and encourage others to do the same.
The principle discouraging data licenses that restrict commercial reuse is an important one. I can see why somebody who is considering releasing a set of bibliographic data into the wild might be tempted to use a license that forbids commercial use. After all, the vast majority of bibliographic records are created or improved by librarians working for non-profit or governmental entities. Although I don’t think anybody ever became rich beyond the dreams of avarice reselling library data, obviously there is some money to be made there, given the existence of commercial and quasi-commercial firms that deal with library metadata. Why should those firms be allowed to make money off the fruits of the labor of countless catalogers without direct financial recompense?
And … I can’t say that I entirely disagree. Libraries spend a lot of money creating metadata in a punishing economy; if some libraries can manage to get some money back to help keep catalogers and metadata specialists employed, so much the better. Until the advent of true artificial intelligence, there will always be an important role for the human creation and maintenance of metadata, though we also need to do a lot better with automating metadata production.
However, bibliographic data is most useful in the aggregate. A single bibliographic record, no matter how well crafted, has very little value. Put enough of them together to describe a library’s collection, and you start to get somewhere: you now have enough to make a catalog. Put a lot of metadata together, and you can do all kinds of interesting things.
It is in the aggregation of metadata where the licensing decisions that libraries make when releasing bibliographic data matter most. The less friction there is to commercial and non-commercial reuse of the data, the more the data will be used and improved.
Consider this: if I, in the course of my duties at a for-profit MPOW, find a file of records that I can do something useful with, I can get started doing that right away if I see a PDDL or CC0 license associated with it. If, instead, I see a no-commercial-use clause, I’ve hit a point of friction. I may choose to track down the contributor and negotiate a separate license, or, more likely, I’ll look for something else to work with. Unless you are the likes of the Library of Congress (i.e., your metadata can’t be ignored), using a non-commercial license when releasing your data simply means that it will be less likely to be used and improved. Worse, if a non-profit decides to aggregate PDDL/CC0 data and commercial-use-restricted data, it is even more difficult for commercial entities to touch the dataset at all — it’s one thing to track down one rights-holder, but dozens?
Metadata is for use. It is also for continual editing and improvement, as metadata is also imperfect and incomplete. A library information ecosystem that promotes easy access to metadata and easy sharing might manage to keep up and stay relevant.
Of course, if numerous commercial entities make use of open bibliographic data but never compensate the libraries who paid to create it in the first place, that would over time become a strong disincentive for libraries to release open data. Therefore, I would like to suggest a fifth point — maybe not so much a principle as a recommendation — to commercial entities who make use of open bibliographic data: consider treating all open data, even data in the public domain, as if there were a mild copyleft license attached to it. In other words: give back. If in the course of providing your service you are not only using open data but improving the records, release your improvements as open bibliographic data. Moreover, invest some time in releasing your improvements in a maximally useful way — putting a file of improved data on your webserver is a good start, but if there are ways to contribute back to shared bibliographic databases or a hypothetical peer-to-peer metadata exchange so that the improvements can be more easily reused, consider doing so.