{"id":270,"date":"2013-03-10T22:56:22","date_gmt":"2013-03-11T05:56:22","guid":{"rendered":"http:\/\/galencharlton.com\/blog\/?p=270"},"modified":"2013-03-20T17:02:59","modified_gmt":"2013-03-21T00:02:59","slug":"bestiary-of-monstrous-data-1-a-trip-through-the-character-mangler","status":"publish","type":"post","link":"https:\/\/galencharlton.com\/blog\/2013\/03\/bestiary-of-monstrous-data-1-a-trip-through-the-character-mangler\/","title":{"rendered":"Bestiary of Monstrous Data #1: a trip through the character mangler"},"content":{"rendered":"<p><em>This is the first part in an occasional series on how good data can go bad.<\/em><img loading=\"lazy\" src=\"https:\/\/galencharlton.com\/blog\/wp-content\/uploads\/2013\/03\/bestiary_viper_thumbnail.jpg\" alt=\"bestiary_viper_thumbnail\" width=\"93\" height=\"100\" class=\"aligncenter size-full wp-image-297\" \/><\/p>\n<p>Consider the following snippets of a MARC21 record for the Spanish edition of the fourth Harry Potter book.<\/p>\n<pre class=\"crayon:false\">\r\n00998nam  2200313 c 4500\r\n...\r\n240 10 $a Harry Potter and the goblet of fire $l Espa\u00f1ol\r\n245 10 $a Harry Potter y el c\u00e1liz de fuego \/ $c J.K. Rowling ; [traducci\u00f3n, Adolfo Mu\u00f1oz Garc\u00eda y Nieves Mart\u00edn Azofra]\r\n<\/pre>\n<p>The original record uses the Unicode character set with the UTF-8 character encoding.  However, If you load this record into a modern ILS, e.g. Koha or Evergreen, the title is likely to end up displayed as:<\/p>\n<pre class=\"crayon:false\">\r\nHarry Potter y el c\u00a9\u0141liz de fuego \/ J.K. Rowling ; [traducci\u00a9\u0111n, Adolfo Mu\u00a9\u0142oz Garc\u00a9\u01afa y Nieves Mart\u00a9\u01afn Azofra]\r\n<\/pre>\n<p>Too much copyright! This isn&#8217;t an electronic course reserves blog!<\/p>\n<p>What happened?  Look at the 9th position of the leader (counting from zero), and you&#8217;ll see that it is blank.  In MARC21, blank means that the record uses the MARC-8 character set, while &#8216;a&#8217; means that it uses Unicode.  Many, if not most, modern MARC tools will go by the Leader\/09 to decide if a character conversion is needed.  If the leader position is wrong, poor, defenseless diacritics will get mangled.<\/p>\n<p>Why are there so many copyright signs in the mistreated title?  As it happens, the UTF-8 representation of many common characters with Western European diacritics starts with byte 195 (or C3 in hexadecimal).  What does C3 mean in the MARC-8 character encoding?  You&#8217;ve guessed it: the copyright symbol.<\/p>\n<p>There are a couple lessons to draw from this.  First, using a good character encoding isn&#8217;t enough; you must also say what you&#8217;re up to.  Second, if you look at enough bad data, you will start to recognize patterns on sight.  If you deal with a lot of data, that &#8220;second sight&#8221; is an arcane but useful skill to develop.<\/p>\n<p><small><em>CC-BY image of a <a href=\"http:\/\/www.flickr.com\/photos\/58558794@N07\/5436869194\/\">woodcut of a viper<\/a> courtesy of the <a href=\"http:\/\/www.flickr.com\/photos\/58558794@N07\/\">Penn Provenance Project<\/a>.<\/em><\/small><\/p>\n<div class=\"sharedaddy sd-sharing-enabled\"><div class=\"robots-nocontent sd-block sd-social sd-social-icon-text sd-sharing\"><h3 class=\"sd-title\">Share this:<\/h3><div class=\"sd-content\"><ul><li class=\"share-twitter\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"sharing-twitter-270\" class=\"share-twitter sd-button share-icon\" href=\"https:\/\/galencharlton.com\/blog\/2013\/03\/bestiary-of-monstrous-data-1-a-trip-through-the-character-mangler\/?share=twitter\" target=\"_blank\" title=\"Click to share on Twitter\"><span>Twitter<\/span><\/a><\/li><li><a href=\"#\" class=\"sharing-anchor sd-button share-more\"><span>More<\/span><\/a><\/li><li class=\"share-end\"><\/li><\/ul><div class=\"sharing-hidden\"><div class=\"inner\" style=\"display: none;\"><ul><li class=\"share-tumblr\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-tumblr sd-button share-icon\" href=\"https:\/\/galencharlton.com\/blog\/2013\/03\/bestiary-of-monstrous-data-1-a-trip-through-the-character-mangler\/?share=tumblr\" target=\"_blank\" title=\"Click to share on Tumblr\"><span>Tumblr<\/span><\/a><\/li><li class=\"share-reddit\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-reddit sd-button share-icon\" href=\"https:\/\/galencharlton.com\/blog\/2013\/03\/bestiary-of-monstrous-data-1-a-trip-through-the-character-mangler\/?share=reddit\" target=\"_blank\" title=\"Click to share on Reddit\"><span>Reddit<\/span><\/a><\/li><li class=\"share-end\"><\/li><li class=\"share-print\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-print sd-button share-icon\" href=\"https:\/\/galencharlton.com\/blog\/2013\/03\/bestiary-of-monstrous-data-1-a-trip-through-the-character-mangler\/\" target=\"_blank\" title=\"Click to print\"><span>Print<\/span><\/a><\/li><li class=\"share-end\"><\/li><\/ul><\/div><\/div><\/div><\/div><\/div>","protected":false},"excerpt":{"rendered":"<p>This is the first part in an occasional series on how good data can go bad. Consider the following snippets of a MARC21 record for&#8230;<\/p>\n<div class=\"sharedaddy sd-sharing-enabled\"><div class=\"robots-nocontent sd-block sd-social sd-social-icon-text sd-sharing\"><h3 class=\"sd-title\">Share this:<\/h3><div class=\"sd-content\"><ul><li class=\"share-twitter\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"sharing-twitter-270\" class=\"share-twitter sd-button share-icon\" href=\"https:\/\/galencharlton.com\/blog\/2013\/03\/bestiary-of-monstrous-data-1-a-trip-through-the-character-mangler\/?share=twitter\" target=\"_blank\" title=\"Click to share on Twitter\"><span>Twitter<\/span><\/a><\/li><li><a href=\"#\" class=\"sharing-anchor sd-button share-more\"><span>More<\/span><\/a><\/li><li class=\"share-end\"><\/li><\/ul><div class=\"sharing-hidden\"><div class=\"inner\" style=\"display: none;\"><ul><li class=\"share-tumblr\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-tumblr sd-button share-icon\" href=\"https:\/\/galencharlton.com\/blog\/2013\/03\/bestiary-of-monstrous-data-1-a-trip-through-the-character-mangler\/?share=tumblr\" target=\"_blank\" title=\"Click to share on Tumblr\"><span>Tumblr<\/span><\/a><\/li><li class=\"share-reddit\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-reddit sd-button share-icon\" href=\"https:\/\/galencharlton.com\/blog\/2013\/03\/bestiary-of-monstrous-data-1-a-trip-through-the-character-mangler\/?share=reddit\" target=\"_blank\" title=\"Click to share on Reddit\"><span>Reddit<\/span><\/a><\/li><li class=\"share-end\"><\/li><li class=\"share-print\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-print sd-button share-icon\" href=\"https:\/\/galencharlton.com\/blog\/2013\/03\/bestiary-of-monstrous-data-1-a-trip-through-the-character-mangler\/\" target=\"_blank\" title=\"Click to print\"><span>Print<\/span><\/a><\/li><li class=\"share-end\"><\/li><\/ul><\/div><\/div><\/div><\/div><\/div>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","jetpack_publicize_message":"Bestiary of Monstrous Data #1: a trip through the character mangler http:\/\/wp.me\/p3gJ9y-4m #code4lib #monstrousdata","jetpack_is_tweetstorm":false},"categories":[29,6],"tags":[],"jetpack_featured_media_url":"","jetpack_publicize_connections":[],"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p3gJ9y-4m","_links":{"self":[{"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/posts\/270"}],"collection":[{"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/comments?post=270"}],"version-history":[{"count":39,"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/posts\/270\/revisions"}],"predecessor-version":[{"id":418,"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/posts\/270\/revisions\/418"}],"wp:attachment":[{"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/media?parent=270"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/categories?post=270"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/galencharlton.com\/blog\/wp-json\/wp\/v2\/tags?post=270"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}