Data cleanup as a force for evil

A quotidian concern of anybody responsible for a database is the messy data it contains. See a record about a Pedro González? Bah, the assumption of Latin-1 strikes again! Better correct it to González. Looking at his record in the first place because you’re reading his obituary? Oh dear, better mark him as deceased. 12,741 people living in the bungalow at 123 Main St.? Let us now ponder the wisdom of the null and the foolishness of the dummy value.

Library name authority control could be viewed as a grand collaborative data cleanup project without having to squint too hard.

What of the morality of data cleanup? Let’s assume that the data should be gathered in the first place; then as Patricia Hayes noted back in 2004, there is of course an ethical expectation that efforts such as medical research will be based on clean data: data that has been carefully collected under systematic supervision.

Let’s consider another context: whether to engage in batch authority cleanup of a library catalog. The decision of whether it is worth the cost, like most decisions on allocating resources, has an ethical dimension: does the improvement in the usefulness of the catalog outweigh the benefits of other potential uses of the money? Sometimes yes, sometimes no, and the decision often depends on local factors, but generally there’s not much examination of the ethics of the data cleanup per se. After all, if you should have the database in the first place, it should be as accurate and precise as you can manage consistent with its raison d’être.

Now let’s consider a particular sort of database. One full of records about people. Specifically, a voter registration database.  There are many like it; after all, at its heart it’s just a slightly overgrown list of names and addresses.

An overgrown list of names of addresses around which much mischief has been done in the name of accuracy.

This is on my mind because the state I live in, Georgia, is conducting a gubernatorial election that just about doubles as a referendum on how to properly maintain a voter registration list.

On the one hand, you have Brian Kemp, the current Georgia secretary of state, whose portfolio includes the office that maintains the statewide voter database and oversees all elections. On other hand, Stacey Abrams, who among other things founded the New Georgia Project aimed at registering tens of thousands of new voters, albeit with mixed results.

Is it odd for somebody to oversee the department that would certify the winner of the governor’s race? The NAACP and others think so, having filed a lawsuit to try to force Kemp to step down as secretary of state. Moreover, Kemp has a history of efforts to “clean” the voter rolls; efforts that tend to depress votes by minorities—in a state that is becoming increasingly purple.  (And consider the county I live in, Gwinnett County. It is the most demographically diverse county in the southeast… and happens to have the highest rate of rejection of absentee ballots so far this year.) Most recently, the journalist Greg Palast published a database of voters purged from Georgia’s list. This database contains 591,000 names removed from the rolls in 2017… one tenth of the list!

A heck of a data cleanup project, eh?

Every record removal that prevents a voter from casting their ballot on election day is an injustice. Every one of the 53,000 voters whose registration is left pending due to the exact match law is suffering an injustice. Hopefully they won’t be put off and will vote… if they can produce ID… if the local registrar discretion leans towards expanding and not collapsing the franchise.

Dare I say it? Data cleanup is not an inherently neutral endeavor.

Sure, much of the time data cleanup work is just improving the accuracy of a database—but not always. If you work with data about people, be wary.

CC BY-SA 4.0 Data cleanup as a force for evil by Galen Charlton is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.