Bestiary of Monstrous Data #1: a trip through the character mangler

This is the first part in an occasional series on how good data can go bad.bestiary_viper_thumbnail

Consider the following snippets of a MARC21 record for the Spanish edition of the fourth Harry Potter book.

00998nam  2200313 c 4500
...
240 10 $a Harry Potter and the goblet of fire $l Español
245 10 $a Harry Potter y el cáliz de fuego / $c J.K. Rowling ; [traducción, Adolfo Muñoz García y Nieves Martín Azofra]

The original record uses the Unicode character set with the UTF-8 character encoding. However, If you load this record into a modern ILS, e.g. Koha or Evergreen, the title is likely to end up displayed as:

Harry Potter y el c©Łliz de fuego / J.K. Rowling ; [traducci©đn, Adolfo Mu©łoz Garc©Ưa y Nieves Mart©Ưn Azofra]

Too much copyright! This isn’t an electronic course reserves blog!

What happened? Look at the 9th position of the leader (counting from zero), and you’ll see that it is blank. In MARC21, blank means that the record uses the MARC-8 character set, while ‘a’ means that it uses Unicode. Many, if not most, modern MARC tools will go by the Leader/09 to decide if a character conversion is needed. If the leader position is wrong, poor, defenseless diacritics will get mangled.

Why are there so many copyright signs in the mistreated title? As it happens, the UTF-8 representation of many common characters with Western European diacritics starts with byte 195 (or C3 in hexadecimal). What does C3 mean in the MARC-8 character encoding? You’ve guessed it: the copyright symbol.

There are a couple lessons to draw from this. First, using a good character encoding isn’t enough; you must also say what you’re up to. Second, if you look at enough bad data, you will start to recognize patterns on sight. If you deal with a lot of data, that “second sight” is an arcane but useful skill to develop.

CC-BY image of a woodcut of a viper courtesy of the Penn Provenance Project.

CC BY-SA 4.0 Bestiary of Monstrous Data #1: a trip through the character mangler by Galen Charlton is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.