This is the second part in an occasional series on how good data can go bad.

bestiary_viper_thumbnailOne aspect of the MARC standard that sometimes is forgotten is that it was meant to be a cataloging communications format. One could design an ILS that doesn’t use anything resembling MARC 21 to store or express bibliographic data, but as long as its internal structure is sufficiently expressive to keep track of the distinctions called for by AACR2, in principle it could relegate MARC handling strictly to import and export functionality. By doing so, it would follow a conception of MARC as a lingua franca for bibliographic software.

In practice, of course, MARC isn’t just a common language for machines — it’s also part of a common language for catalogers.  If you say “222” or “245” or “780” to one, you’ve communicated a reasonably precise (in the context of AACR2) identification of a metadata attribute.  Sure, it’s arcane, but then again so is most professional jargon to non-practitioners.  MARC also become the basis of record storage and editing in most ILSs, to the point where the act of cataloging is sometimes conflated with the act of creating and editing MARC records.

But MARC’s origins as a communications format can sometimes conflict with its ad hoc role as a storage format.  Consider this record:

00528dam  22001577u 4500
001 123
100 1  $a Strang, Elizabeth Leonard.
245 10 $a Lectures on landscape and gardening design / $c by Elizabeth Leonard Strang.

A brief bibliographic record, right?  Look at the Leader/05, which stores the the record status.  The value ‘d’ means that the record is deleted; other values for that position include ‘n’ for new and ‘c’ for corrected.

But unlike, say, the 245, the Leader/05 isn’t making an assertion about a bibliographic entity.  It’s making an assertion about the metadata record itself, and one that requires more context to make sense.  There can’t be a globally valid assertion that a record is deleted; my public library may have deaccessioned Lectures on landscape and gardening design, but your horticultural library may keep that title indefinitely.

Consequently, the Leader/05 is often ignored when creating or modifying records in an ILS.  For example, if a bib record is present in an Evergreen or Koha database, setting its Leader/05 to ‘d’ does not affect its indexing or display.

However, such records can become undead — not in the context of the ILS, but in the context of exporting them for loading into a discovery layer or a union catalog. Some discovery layers do look at the Leader/05.  If an incoming record is marked as deleted, that is taken as a signal to remove the matching record from the discovery layer’s own indexes.  If there is no matching record, the discovery layer would be reasonable to ignore an incoming “deleted” record — and I know of at least that does exactly that.

The result? A record that appears to be perfectly good in the ILS doesn’t show up in the discovery layer.

Context matters.

I’ll finish with a couple SQL queries for finding such undead records, one for Evergreen:

SELECT record
FROM metabib.full_rec mfr
JOIN biblio.record_entry bre ON (bre.id = mfr.record)
WHERE tag = 'LDR'
AND SUBSTRING(value, 6, 1) = 'd'
AND NOT bre.deleted;

and one for Koha:

SELECT biblionumber
FROM biblioitems 
WHERE ExtractValue(marcxml, 'substring(//leader, 6, 1)') = 'd';

 

CC-BY image of a woodcut of a viper courtesy of the Penn Provenance Project.

This is the first part in an occasional series on how good data can go bad.bestiary_viper_thumbnail

Consider the following snippets of a MARC21 record for the Spanish edition of the fourth Harry Potter book.

00998nam  2200313 c 4500
...
240 10 $a Harry Potter and the goblet of fire $l Español
245 10 $a Harry Potter y el cáliz de fuego / $c J.K. Rowling ; [traducción, Adolfo Muñoz García y Nieves Martín Azofra]

The original record uses the Unicode character set with the UTF-8 character encoding. However, If you load this record into a modern ILS, e.g. Koha or Evergreen, the title is likely to end up displayed as:

Harry Potter y el c©Łliz de fuego / J.K. Rowling ; [traducci©đn, Adolfo Mu©łoz Garc©Ưa y Nieves Mart©Ưn Azofra]

Too much copyright! This isn’t an electronic course reserves blog!

What happened? Look at the 9th position of the leader (counting from zero), and you’ll see that it is blank. In MARC21, blank means that the record uses the MARC-8 character set, while ‘a’ means that it uses Unicode. Many, if not most, modern MARC tools will go by the Leader/09 to decide if a character conversion is needed. If the leader position is wrong, poor, defenseless diacritics will get mangled.

Why are there so many copyright signs in the mistreated title? As it happens, the UTF-8 representation of many common characters with Western European diacritics starts with byte 195 (or C3 in hexadecimal). What does C3 mean in the MARC-8 character encoding? You’ve guessed it: the copyright symbol.

There are a couple lessons to draw from this. First, using a good character encoding isn’t enough; you must also say what you’re up to. Second, if you look at enough bad data, you will start to recognize patterns on sight. If you deal with a lot of data, that “second sight” is an arcane but useful skill to develop.

CC-BY image of a woodcut of a viper courtesy of the Penn Provenance Project.