We haven’t posted in a while. Things have been happening… This is just a brief update on some recent work I’ve been doing with the geoparser. If I don’t make it brief it won’t happen at all.
Way back in the summer we had a meetup in New York (roughly the geographical centroid of our team!) and one of the things I was working on then was experimenting with broadening the range of texts we could process. We started out of course with ancient places, and we know the geoparser is happy enough with modern places, so we thought we’d have a go at places in fiction. I had some fun with Around the World in 80 Days and A Tale of Two Cities, for example.
However, one of the issues we’ve been needing to sort out for a while is getting the geoparser properly available on the web so that all this processing of texts doesn’t have to happen on my local machine. That will open the door to the collaborations we want to foster, eg with Project Bamboo, because people will be able to call the geoparser remotely, as an element in their own processing pipelines.
I’m happy to say that this has now become a possibility, thanks to work by Edina and the Language Technology Group at Edinburgh. Edina maintain the Unlock Text api which is the web face of the Edinburgh Geoparser. It has a REST interface that allows you to POST texts using your local URLs and GET the geoparsed results in xml or json format. The documentation page on the site explains how.
There are a couple of important points that the documentation doesn’t yet mention. The first is that, as well as getting a list of placenames found in your text as output, you can also get the full text back with the placenames annotated as standoff xml. This is in the ‘.lem.xml’ and ‘.lem.json’ output files. In fact you get a whole lot more stuff besides placenames as we also pick out other entities and relations between them – see the LTG page mentioned above for information about the full range of pipeline outputs. But if you just want placenames look for the <ent> elements with attribute ‘type=”location”‘. These elements point back to the relevant string in the text and include the latitude and longitude and a reference to the source gazetteer entry.
The second point to note is that you can specify which gazetteer you want to reference your text against. If you don’t specify then it uses all the ones it knows about, currently ‘unlock’, ‘os’, ‘geonames’, ‘naturalearth’ or ‘plplus’. This can produce interesting information but often you’ll want to use a particular one. In the case of classical texts that will be Pleiades+ (‘plplus’) which references the Pleiades gazetteer as we’ve explained in earlier blog posts. The gazetteer can be specified in the json data item, following the name of the input file, as a ‘gazetteer’:’gazname’ pair, like this:
As for input formats, the Unlock Text interface will take plain text or html (though it may struggle with very complex html) and it will do its best with pdf.
This is all still a bit of a work in progress, but we are making progress. 🙂