Geoparsing the HESTIA text of Herodotus

I come to the GAP project from having worked on the Edinburgh Geoparser (see e.g., a tool that comprises of two main components: 1. The geotagger – which finds placename mentions in the text, and 2. The georesolver – which determines the geographical position of the places, where possible.

I’m responsible for processing the GAP texts with the Edinburgh Geoparser, starting with the HESTIA text of Herodotus. There are two initial tasks to confront:

First, I’m evaluating how well the geotagging is likely to work on GAP texts by using the HESTIA text of Herodotus as a test case: my assumption (which may be false!) is that the Herodotus text is reasonably representative of what we’ll be dealing with, in terms of places and peoples mentioned. It allows us to do a formal evaluation since it has been marked up in TEI and verified by hand. I’ve therefore written an evaluation routine that checks the recall, precision and F-score (the harmonic mean of P and R) of the geotagger’s output, compared with HESTIA’s Herodotus gold standard. (I strip the markup off the hdt_eng-p5-2.xml file, run it through the geotagger, then compare the output against the original.) So far the issues I’m hitting are that, whilst recall is good (81.47%), precision is poor (39.29%). From a brief study of the results, this seems to be largely due to the tagger misclassifying personal names as placenames: to counter the problem I plan to build a gazetteer list of personal names extracted from the Herodotus text, and amend the parser to use it. We’ll see whether that helps to achieve a more satisfactory precision rate..

Second, I’m in the process of loading the latest version of the “Pleiades+” placename gazetteer (provided by Leif – see previous blog) into a MySQL database and amending the Geoparser to query it for the georesolving step. The output here will be text (still using the HESTIA Herodotus) with place names that can be plotted on a map – only approximately, however, because we’re using Pleiades’ Barrington Atlas grid-square centres for all cases except where geonames gives a better fix.

Next plans
i) Once I’m happy with the precision rate, I’ll try altering the Geoparser so that it can take advantage (in step b) for two distinct geographical gazetteers, in this case Pleiades+ and Geonames.
ii) Move on to other texts!

PS. I’ve highlighted the two-phase nature of the Geoparser as I want to make it clear that the system will only determine the spatial position of places it identifies in step 1, where it doesn’t have access to the gazetteer. This may be a bit of an issue for GAP texts, where we expect to have a high proportion of ancient place names and personal names, which the geotagging component may struggle to recognise. I’ll think about whether there’s anything I can do, other than what I’ve already suggested for the geotagging evaluation.


About katefbyrne

Researcher in the Language Technology Group at Edinburgh University.
