Good metadata requires standardization and consistency. But historical documents and literature are notoriously messy; how can we create databases, digital editions, and data visualizations that rely on consistent data while maintaining the authenticity and spirit of the original dataset?
These are questions that I’m dealing with working on a database for the Cincinnati House of Refuge Project. I’m standardizing over 6,000 intake records from the 19th century and I’m struggling to make decisions when dealing with data that isn’t consistent.
I was going to suggest some Natural Language Processing approaches for this but I noticed that there is another session being offered on cleaning up and normalizing large data sets with some open source tools so that might be better for you.
Session notes: twitter.com/hashtag/ethicsinmetadata?src=hash&vertical=default&f=tweets