A quick update on what I’ve been up to recently, and plans for the next couple of weeks:
- I’ve experimented with combining Pleiades+ with geonames (ie looking up toponyms in both) but, as expected, this floods the results with too many modern places, mostly in the Americas. See http://synapse.inf.ed.ac.uk/~kate/gap/plplusgeonames.html (and zoom out to see whole world).
- I tried using the “bounding box” feature of the geoparser to indicate a strong preference for locations in Europe and North Africa, which removes many of the American locations. But if there is only one candidate it will be chosen, wherever it is, so this is still not satisfactory for ancient texts. See http://synapse.inf.ed.ac.uk/~kate/gap/plplusgeonamesbounded.html.
- What does seem to work well is Leif’s idea of using geonames as a source of alternative names to try against Pleiades+. For example, we now find a location for “Egypt” because “Aegyptus” is one of the geonames alternatives and is in Pleiades+. See http://synapse.inf.ed.ac.uk/~kate/gap/plplusgeoaltsdisplay.html.
- The next thing is to try to improve the positioning we get from Pleiades+. So far we’re getting quite a few returns with no lat/lon position (which the geoparser plots on the equator, as zero/zero). Leif has had a look at the data and explained what’s going on. It looks as if we can fill in many of the blanks by a lookup through to geonames; that’s now on my to-do list.
- Once we believe we have the system working well, we will need to devise a method for formal evaluation against the Hestia gold standard data, to check whether we are finding the correct locations for toponyms where there are multiple candidates (such as for Salamis, which is the name of more than one place). We’ve discussed ways that this might be approached.
- In parallel to this work on the georesolution step, I’m working on improving the geotagging, by adding gazetteers of personal names (like Priam, Medea etc) to the process. At the moment “Priam” for example is recognised as a place name because there is a Priam in the USA, and it’s not listed as a common personal name in the references the geoparser uses.
- I’ve already written scripts for formal evaluation of the geotagging step against the Hestia gold standard, ie to check whether we are identifying the same place names. This is more straightforward than the georesolution evaluation because we can tokenise in the same way both the marked-up data and the plain version that goes through the geoparser, and then compare the two data sets token by token, to produce standard Precision/Recall/F-score measures. As noted in an earlier post, the recall is good but precision is in need of attention.