As part of my cemetery project I’m working with lots of fairly messy biographical data about individuals buried there. The site I’m working on, East End Cemetery: A Place and Its People, has information on over 300 individuals at the moment. Much of this is fairly detailed, with photographs, additional research, and narratives woven in. The East End Collaboratory is a complimentary set of projects run through universities at the cemetery. The Collaboratory has a very rich interactive map with entries from FindAGrave cross referenced (we opted not to do that directly on and mapped locations of markers. The Collaboratory dataset consists of over 1800 records.

openrefine logo

reconcile-csv logo

From the beginning, I wanted to connect to records in the Collaboratory map. But there’s a good deal of variation in the way data on burials has been collected, depending on transcription methodology and source. I’ve been mulling this connection over for a while and finally had an opportunity to dig in.

I’m an OpenRefine devotee to the core, and it’s my favorite way to really dig in and clean messy data tables. Through some web searches, I stumbled upon reconcile-csv, designed for just this purpose. Already included in OpenRefine is functionality to reconcile datasets with Linked Open Data sources such as Wikidata. Reconcile-csv allows you to create your own reference dataset to match against, instead of using a source like Wikidata which is hosted elsewhere.

Reconcile-csv runs a service with Java. The first time I tried to connect the two datasets, the operation failed (just spinning and spinning). I could tell there was an error somewhere, but the messages were not helpful to me, with my lack of Java familiarity. I did some testing and got things to work with some VERY simple test datasets, so I stripped things down to their simplest parts and tried again. Lo and behold, matches were made.

I embarked upon this process never having used OpenRefine’s reconcilliation tools with Wikidata or any of the other available options. The documention on the reconcile-csv site is sparse. On my initial attempts, I just loaded both datasets straight in; I chose the our own website dataset as the more authoritative, using it to set the search field and ID in reconcile-csv. I used the Collaboratory data as the match target. It’s a much larger dataset. I could have set it up the other way around (with Collaboratory website data as the search data), but this seemed to work just fine.

Since the process failed on the first attempt, I removed all but the most necessary fields in the two datasets before attempting a match. [This was probably only necessary for the search dataset, but lesson learned.] I ran reconcile-csv and specified the search dataset, search column and ID column to return. Then I opened the Target dataset (the larger Collaboratory sheet) and created a “combined name” column that would be similar to the “Post Title” column from the website.

I found a single YouTube Tutorial on running reconcile-csv. It’s in French, so your mileage may vary. I found it incredibly helpful for demonstrating how things were supposed to work.

A straight reconciliation produced 54 matches on the 280 search values. Not bad. When I filtered for matches with a score of 80% or higher, that caught almost everything with very few false positives. I did a lot of manual matching this time around, but now I know that I can just click that button in the future and then double check the automated matches.

Names being reconciled from my project

Once I had my workflow in hand, I was able to find very near 100% of all matches this way. I’m going to continue putting fuzzy matches to the test, working within this cemetery dataset to identify and link related records. I’m also concocting schemes for a big project at work with archaeological archives, fuzzy matching references to gray literature in databases (very non-standardized) with our digital collection.

Overall, this is likely to be a game changer for me. I’d like to explore other methods of fuzzy matching in R and compare the two. More coming on this topic, for sure.

comments powered by Disqus