Some progress on geo-resolution

A quick update on what I’ve been up to recently, and plans for the next couple of weeks:

  • I’ve experimented with combining Pleiades+ with geonames (ie looking up toponyms in both) but, as expected, this floods the results with too many modern places, mostly in the Americas. See http://synapse.inf.ed.ac.uk/~kate/gap/plplusgeonames.html (and zoom out to see whole world).
  • I tried using the “bounding box” feature of the geoparser to indicate a strong preference for locations in Europe and North Africa, which removes many of the American locations. But if there is only one candidate it will be chosen, wherever it is, so this is still not satisfactory for ancient texts. See http://synapse.inf.ed.ac.uk/~kate/gap/plplusgeonamesbounded.html.
  • What does seem to work well is Leif’s idea of using geonames as a source of alternative names to try against Pleiades+. For example, we now find a location for “Egypt” because “Aegyptus” is one of the geonames alternatives and is in Pleiades+. See http://synapse.inf.ed.ac.uk/~kate/gap/plplusgeoaltsdisplay.html.
  • The next thing is to try to improve the positioning we get from Pleiades+. So far we’re getting quite a few returns with no lat/lon position (which the geoparser plots on the equator, as zero/zero). Leif has had a look at the data and explained what’s going on. It looks as if we can fill in many of the blanks by a lookup through to geonames; that’s now on my to-do list.
  • Once we believe we have the system working well, we will need to devise a method for formal evaluation against the Hestia gold standard data, to check whether we are finding the correct locations for toponyms where there are multiple candidates (such as for Salamis, which is the name of more than one place). We’ve discussed ways that this might be approached.
  • In parallel to this work on the georesolution step, I’m working on improving the geotagging, by adding gazetteers of personal names (like Priam, Medea etc) to the process. At the moment “Priam” for example is recognised as a place name because there is a Priam in the USA, and it’s not listed as a common personal name in the references the geoparser uses.
  • I’ve already written scripts for formal evaluation of the geotagging step against the Hestia gold standard, ie to check whether we are identifying the same place names. This is more straightforward than the georesolution evaluation because we can tokenise in the same way both the marked-up data and the plain version that goes through the geoparser, and then compare the two data sets token by token, to produce standard Precision/Recall/F-score measures. As noted in an earlier post, the recall is good but precision is in need of attention.
Posted in Uncategorized | 2 Comments

Visualising some sample results using Pleiades+

Now that the new extended version of the Pleiades name-set based on GeoNames (aka Pleiades+) is available, I’ve altered the Geoparser to use it as a gazetteer in the georesolving step for working out the geographical location of places mentioned in the text. I’ve posted some sample results, for Book 1 of the HESTIA Herodotus, at http://synapse.inf.ed.ac.uk/~kate/gap/plplusdisplay.html. This shows the place-names found in the geotagging step and the location that was ranked first by the georesolver, if there were one or more matches in Pleiades+.

As this sample shows, there are some erroneous “places” (like “Priam”) and some valid places for which no location was found (like “Egypt”). The first issue is to do with improving precision in the geotagging step, as discussed in my last post. The second issue arises because Pleiades+ does not include modern place-names: Pleiades obviously has Egypt in its dataset, but it resides under the label “Aegyptus”. (Pleiades also prefers the Latinised forms of places to the Greek.)

We are currently trying to work out the best way of dealing with the missing place-names problem. One option is to use GeoNames as a default, if no match can be found in Pleiades+: but this solution brings with it the danger of swamping the user with contemporary place-names, of the kind that we wouldn’t expect to find in ancient texts. A neat idea suggested by Leif is to look up missing places like Egypt in GeoNames, and then try to match the alternative names found listed there against Pleiades+. As it happens this would work fine in the case of Egypt – because “Aegyptus” is indeed one of the alternative names – but we wouldn’t find the match in the first place because it’s only listed as an alternative for “Arab Republic of Egypt” not for “Egypt” itself. Over the next week or so I’m going to investigate whether this option is feasible within the geoparser’s architecture. (We may find there are simply too many alternative names to handle, as in general every place name has multiple candidates in GeoNames, each of which has in turn itself multiple alternative forms.)

Another intriguing idea has been proposed by Prof. Bruce Robertson of Mount Allison University (Canada), who acts as a Technical Observer for the Pleiades Project: he wonders whether we could effectively use a Latin Wikipedia, especially given the fact that Pleiades has a penchant for Latin names – and, indeed, in our case, this resource would find Egypt using the string “Aegyptus”. Additionally, at the top of the each page, the relevant entity has Lat/Long given, which at the very least could be used as a sanity check!

In the end, however, it may turn out that we need an enhanced version of Pleiades+ – a Pleiades++ as it were – which would contain the kinds of names that we expect to occur.

Posted in Uncategorized | 3 Comments

A DIALOG between projects: Bridging the GAP to ancient world data

HESTIA has started to use the latest digital technology for the interrogation of geographical concepts mentioned in an ancient historical narrative; GAP builds on this research by pioneering the means to discover ancient places not only in a single text like Herodotus’ Histories but using the entire corpus of GoogleBooks; DIALOG goes one step further still by starting to bring together ancient world research so that different kinds of data related to any given ancient location can be discovered, queried and visualised.

DIALOG (Document and Integrate Ancient Linked Open Geodata) is being funded by JISC (strand 15/10, Infrastructure for Education and Research: Geospatial) and will run from 1 February to 31 October 2011. Employing Linked Open Data (LOD) principles to connect textual, visual and tabular documents that reference places in Ancient World research, it has three primary aims:
i) To define a Core Ontology for Place References (COPR)
ii) To document the process of assimilating place references and publish as Resource Description Framework (RDF)
iii) To develop neo-geographic Web services and tools that can make the published RDF available easily consumable by learners, educators, researchers and the public.

We believe that, by using LOD principles to connect geo-situated textual, visual and tabular documents (hence LOG: Linked Open Geodata), DIALOG will dramatically empower learners, teachers and researchers in seeking to find and use geospatial data and services.

Led by HESTIA and GAP’s Elton Barker (Classical Studies, The Open University) and Leif Isaksen (Archaeological Computing Research Group, Southampton), in collaboration with the JISC-funded project LUCERO (The Open University), DIALOG brings together an experienced, international and interdisciplinary consortium of pre-established teams that use geospatial information technologies for Ancient World research. They are (with datasets in parentheses):

Perseus, Tufts (XML-encoded free-text)
GAP (narrative free texts)
Supporting Productive Queries for Research, KCL, London (fragmentary free- texts)
Arachne, Cologne (database records of material finds)
Digital Memory Engineering, Austrian Institute of Technology (rasterized maps)

DIALOG partners will exchange data, practices and experience with each other as they align their place referencing to the Uniform Resource Identifiers (URIs) produced by the Pleiades gazetteer of ancient places. In this way, when one project points to a particular ancient location, it will be possible for the user to find out what other datasets also refer to that place, and bring that information to bear on their analysis of it.

We are immensely grateful for all their support (and from others) in putting this successful proposal together and we look forward to working with them!

More anon

Posted in Uncategorized | Leave a comment

Geoparsing the HESTIA text of Herodotus

I come to the GAP project from having worked on the Edinburgh Geoparser (see e.g. www.inf.ed.ac.uk/publications/online/1360.pdf), a tool that comprises of two main components: 1. The geotagger – which finds placename mentions in the text, and 2. The georesolver – which determines the geographical position of the places, where possible.

I’m responsible for processing the GAP texts with the Edinburgh Geoparser, starting with the HESTIA text of Herodotus. There are two initial tasks to confront:

First, I’m evaluating how well the geotagging is likely to work on GAP texts by using the HESTIA text of Herodotus as a test case: my assumption (which may be false!) is that the Herodotus text is reasonably representative of what we’ll be dealing with, in terms of places and peoples mentioned. It allows us to do a formal evaluation since it has been marked up in TEI and verified by hand. I’ve therefore written an evaluation routine that checks the recall, precision and F-score (the harmonic mean of P and R) of the geotagger’s output, compared with HESTIA’s Herodotus gold standard. (I strip the markup off the hdt_eng-p5-2.xml file, run it through the geotagger, then compare the output against the original.) So far the issues I’m hitting are that, whilst recall is good (81.47%), precision is poor (39.29%). From a brief study of the results, this seems to be largely due to the tagger misclassifying personal names as placenames: to counter the problem I plan to build a gazetteer list of personal names extracted from the Herodotus text, and amend the parser to use it. We’ll see whether that helps to achieve a more satisfactory precision rate..

Second, I’m in the process of loading the latest version of the “Pleiades+” placename gazetteer (provided by Leif – see previous blog) into a MySQL database and amending the Geoparser to query it for the georesolving step. The output here will be text (still using the HESTIA Herodotus) with place names that can be plotted on a map – only approximately, however, because we’re using Pleiades’ Barrington Atlas grid-square centres for all cases except where geonames gives a better fix.

Next plans
i) Once I’m happy with the precision rate, I’ll try altering the Geoparser so that it can take advantage (in step b) for two distinct geographical gazetteers, in this case Pleiades+ and Geonames.
ii) Move on to other texts!

PS. I’ve highlighted the two-phase nature of the Geoparser as I want to make it clear that the system will only determine the spatial position of places it identifies in step 1, where it doesn’t have access to the gazetteer. This may be a bit of an issue for GAP texts, where we expect to have a high proportion of ancient place names and personal names, which the geotagging component may struggle to recognise. I’ll think about whether there’s anything I can do, other than what I’ve already suggested for the geotagging evaluation.

Posted in Uncategorized | 1 Comment

Pleiades+ : adapting the ancient world gazetteer for GAP

Pleiades (http://pleiades.stoa.org/), a project that is digitizing the Barrington Atlas of the Greek and Roman World (R.J.A. Talbert, ed., Princeton, 2000), is in the process of putting on line the most extensive and accurate coverage of ancient locations published thus far. As such it will provide the basic gazetteer for GAP’s identification of ancient places in the Google Books corpus. Yet, in its present form there are two significant drawbacks to using Pleiades that GAP initially must overcome.

The first is that Pleiades does not currently provide specific coordinates but only the grid square of each location in the Barrington Atlas. This has implications for the plotting of locations and, consequently, the clustering mechanics upon which GAP’s place resolving algorithm is based. The good news is, however, that Pleiades is working in conjunction with the Digital Atlas of Roman and Medieval Civilization (DARMC) in order to provide these coordinates (see: http://pleiades.stoa.org/Members/sgillies/news-items/first-coordinates-from-darmc). For the time being, then, we will have to make do with a grid square centroid that broadly suffices for calculating the GAP algorithm – but we keep our fingers crossed that very soon we’ll be able to draw upon the specific coordinates used of each ancient location mapped in Pleiades.

The second issue is that Pleiades has only limited support for multiple toponyms for the same location. The employment of synonyms may be caused by a variety of factors – for both historic and linguistic reasons – but this problem is particularly aggravated by the tendency for authors throughout time to use contemporary names for ancient places (e.g. ‘London for ‘Londinium’) in their studies, commentaries and translations. Since Pleiades does not readily contain alternative toponyms, and certainly not all alternatives, many place-references in our corpus of books may fail to be ‘tagged’. In order to mitigate this problem we have sought to align Pleiades with the contemporary digital gazetteer GeoNames, whose database is available for download free of charge under a creative commons attribution license (http://www.geonames.org/). Geonames provides multiple contemporary toponyms as well as frequently the ancient name itself. Although the process is somewhat laborious, and by no means comprehensive, we have found that the main Pleiades gazetteer can be expanded in this manner by approximately 30% (from approx. 31,000 to 43,000 toponyms). We refer to the expanded gazetteer as Pleiades+ (‘Pleiades Plus’).

A technical summary in 11 stages
The following steps were undertaken in order to produce Pleiades+:
1. The latest data dump is retrieved from Pleiades. This includes a table of Pleiades locations, a table of the Barrington Atlas Ids and table of the Barrington Atlas Maps.
2. The latest data dump is retrieved from GeoNames. This includes all data from countries covered by the Pleiades gazetteer, filtered to exclude irrelevant feature types (e.g. Airports, etc.).
3. Alternative toponyms are extracted from the Pleiades and GeoNames gazetteers in order to produce ‘Toponym tables’ which map normalized toponyms to their identifiers (this equates to a ‘many-to-many’ mapping).
4. A table of Barrington Atlas grid squares and their bounding coordinates is calculated from a table of the Barrington Atlas maps.
5. The Barrington Atlas Ids table is expanded to extract the grid square(s) associated with each Pleiades identifier and joined to the Grid Square table in order to access its bounding coordinates.
6. The Pleiades Toponym table is joined to the Barrington Atlas Ids table (and thus in turn to the Grid Square table) in order to ascertain bounding coordinates for each toponym, where known.
7. The Geonames Toponym table is rejoined to the Geonames gazetteer in order to ascertain coordinates for each toponym.
8. The Pleiades Toponym table is aligned with the Geonames Toponym table in cases where the normalized toponyms are the same and the Geonames coordinates fall within the bounding coordinates of the Pleiades toponym. This has the result of matching Pleiades identifiers to Geonames identifiers.
9. The resulting Pleiades-GeoNames matches are then rejoined to the GeoNames Toponym table in order to elicit all the other Geonames toponyms associated with the GeoNames ID.
10. A new CSV file is generated containing i) the original Pleiades Toponym table expanded to include a centroid for each grid square as a proxy location; and ii) the additional toponyms derived from GeoNames.
11. The final list of results contains the following fields:
a. The Pleiades identifier (mandatory)
b. The normalized toponym (mandatory)
c. The unnormalized toponym, (mandatory)
d. The source [Pleiades | GeoNames ] (mandatory)
e. The GeoNames Id (mandatory if Geonames is the source)
f. The Pleaides centroid x, y (where known)

Posted in Uncategorized | 1 Comment

GAP work-in-progress report 1 (Oct 10-Jan 11)

GAP has been running for just over three months, so we thought that now was an appropriate time to pause and reflect where we are in our attempt to extend the discovery and querying of ancient places from the HESTIA ‘gold-standard’ Herodotus (with all places verified by hand) to the unstructured 1.2 million books that comprise the Google Books corpus (where places won’t be identified in the XML mark-up). In the series of posts that follow, each member of the team attempts to summarize the work that they’ve done, the problems they’ve encountered, and the next steps they intend to take.

We believe that there are at least two good reasons for making this activity transparent and for documenting our procedure. First, it’s a useful exercise for us to go through as a team: taking the time at the end of our first quarter together to consider the point that we’ve reached is helpful in bringing back into focus the goals of the project, while alerting us to the direction that the follow-up work needs to take. Second, it gives us the opportunity to think about and process particular issues that may have arisen. One benefit of doing this may be to facilitate overcoming those problems; here, the experience of users will be of particular value, so we welcome any feedback you may have to give! But of even greater importance is the documentation of the problems themselves.

To give a short example: only last week I gave a presentation of HESTIA at a one-day workshop on GIS for historians at the Institute of Historical Research in London (http://www.history.ac.uk/node/2278/), at the invitation of Ian Gregory (http://www.lancs.ac.uk/staff/gregoryi/). One of the key points to emerge from the subsequent discussion (specifically raised by one of the researchers on a recently started project: http://www.ucl.ac.uk/sargon/) was the crucial importance for digital humanists to record the issues that they encountered in the process of delivering on their outcomes: even if they themselves were unable to find an appropriate solution for any given problem, by making a record of the issues raised they might help other researchers (from not necessarily the same Humanities subject area) to avoid making the same mistakes or at least learn from the previous responses attempted. In this sense, the practice and recognition of digital humanities research may come to resemble work done in sciences, in which, even if the experiment ultimately fails, the process through which one goes has a value in and of itself.

With the greater good in mind then, let me allow Eric, Kate and Leif share with you their story so far of GAP in their own words.

Posted in Uncategorized | Tagged , , , , , , , , | Leave a comment

Visualizing Associations of Place Names in Texts

In my last post, I discussed some early analysis, based on the HESTIA project, on place-names in the Histories of Herodotus. After lots of grief with memory and other server configuration issues, I finally managed to deploy a preliminary interface and visualization to these analyses. One thing to note.

The more red the place-marks and lines, the stronger the association of between place terms. The GoogleMaps rendering loads pretty big / complicated KML files in the background. It’s sometimes not too smooth and may take a page refresh or playing with the zoom controls before you can see something.

  1. Here’s an example for Byzantium (direct link to a large map)
  2. Here’s another example for a more distant place term (from Herodotus’ perspective), Palestine (direct link to a large map)

Notice, in the case of Palestine, the strongest associations with other place terms go to Egypt and Syria. These are two other distant regions from Herodotus’ perspective.

The ‘Strength of Association’ metric we’re playing with is based on the following method. Each token in the Histories has an index number, counting from 1 at the beginning of the text to 241,950 at the end of the text. We use these index numbers to calculate distances between toponymns.

Right now, we’re using an inverse square relationship to calculate strength of association. We chose this for no particular reason, except that it seems to nicely weigh very close co-occurance of toponymns much more than more distant co-occurances (gravity, light, and other physical phenomena relate nicely to inverse square laws). The closer two toponymns occur in the text, the stronger the association.

Next up: Referencing Pleiades place entities!

Please note! These results are preliminary and exploratory. We’re playing with methods and make no claims that this approach has any particular analytic value.

Posted in progress | Leave a comment