I’ve been aware for a while that there was a mismatch between the resources used by the Geoparser for geotagging (finding toponyms in text) and georesolution (determining their lat/long position) – and I’ve now got around to dealing with that. We’ve been trying to use the Geoparser without too much tweaking and reprogramming, but I clearly needed to make the lexicons it uses in the geotagging tie up with the Pleiades+ gazetteer it uses for georesolution.
For place names this is pretty straightforward, as the new lexicon is largely derived directly from Pleiades+. I also needed a lexicon of ancient personal names, as one of the main reasons for poor precision and recall scores on the geotagging seemed to be that there were too many confusions over personal names: there are several places (in the modern world) called Priam, for example.
Dropping the modern place name lexicons altogether improves performance, and adding lists of ancient personal names has helped still further. The overall result is that, although there’s much tinkering we could still do, the geotagging is now producing pretty good results that are fit for our purposes. Compared against the gold standard of our hand-annotated Hestia data, the performance scores (using standard NLP precision, recall and F1 measure) are:
precision (percentage of our tags that are correct): 77.74%
recall (percentage of target we find): 95.58%
F1 (harmonic mean): 85.74%
There’s a simple display of Herodotus Book 1 text at http://synapse.inf.ed.ac.uk/~kate/gap/normname2.display.html. That display only highlights toponyms in the text, but one of the other things we’re playing around with is identifying personal names and temporal expressions. It may be that we can do interesting things with those in GAP, if we can identify them reliably.
The next thing I’m planning to do is to get back to processing actual Google Book texts. I’d interrupted myself on that in order to fix the problems with the geotagging performance.