Digital Humanities 2011

The GAP team recently returned from Digital Humanities 2011 conference at Stanford where we gave a paper. Not only was it a highly enjoyable conference, it also gave us the first opportunity for all four of us to sit in the same room together at the same time! Elton and Leif made it out a week early to visit Eric in beautiful Berkeley and get some last minute coding done, helped by some earlier text crunching by Kate. This really helped us get to a point where GAP is starting to deliver on its promise: the ability to visualise the spatiality of texts, and discover texts associated with places. On top of that we also had a stimulating discussion with Google’s John Orwant, saw some great papers and ate like kings 🙂

Posted in Uncategorized | Leave a comment

A Few Examples for DH 2011

I’m posting a few links to some examples of place identifications performed by the GAP team. Please note these results are preliminary.

Here’s an early “Alpha” version of displaying identified places on a map. It has bugs, and certain books will probably crash your browser, so we’re showing Tacitus only for now until we work out ways to progressively load identified places:

  • Narrative Map of Tacitus (Note: to be updated shortly)

Here’s a few examples of our “Placebook” interface:

Posted in Uncategorized | Leave a comment

Google Books, Identifiers, and Referencing Tokens

The GAP project explores workflows needed to identify ancient places from unstructured texts (books) so that researchers can reference these ancient places in Linked Data applications. Most of the important challenges that we note relate to problems concerning identifiers of texts, fragments of texts (including individual “tokens”), and place entities. Below we describe some of these issues.

Why Token Identification Matters

Tokens (usually individual words) are fundamental unites in text analysis and entity identification. The clear identification of tokens represents a fundamental need for making text analysis and entity identification an integral part of scholarly practice. The adage of “garbage in, garbage out” applies to textual analysis, and tokenization represents an important first step in many later analytic approaches to texts. The reliability and quality of tokenization processes impacts later downstream analysis.

Various text mining algorithms are far from perfect. Such algorithms often require special “tuning” to suite the book or corpus under study. The results of these processes can and should be questioned. Moreover, researchers may want to apply different sorts of text analysis algorithms to the same texts, perhaps using certain approaches for the identification of historical events or persons, and other algorithms to identify historical places. Researchers will need to combine results of different algorithmic analysis to compare and evaluate the outcomes of different approaches to text mining.

Because of these needs, individual tokens need clear, consistent, and persistent identifiers. Such identifiers could be used to compare and contrast entity identification results. For example the token “Paris” may be identified as a person (a character from the Iliad) by one algorithm, while a different algorithm may identify Paris as a geographic place. Persistent identifiers for tokens can be useful for identifying these conflicting results.

Identifying Tokens in Books

Google Books offers fairly stable URIs to individual books and pages in individual books. We note that the URIs to Google Books and pages could be made more trustworthy if they did not include query parameters, but they are suitable for referencing entities at the granularity of a single page of a given book. If one looks at the HTML markup of the Google Books data, one finds individual tokens (words) bounded by <span> elements. These  <span> elements themselves have title attributes that describe bounding boxes for the tokens. Presumably these bounding boxes note the position of the word or token on the scanned image of a page. Google probably uses these bounding box data to highlight terms relevant to a user’s search request.

The bounding box data represents the only identification for specific tokens in the Google Books HTML markup. Unfortunately (for our purposes), Google uses the title attribute and not the id attribute for expressing bounding boxes. Thus identifying and referencing tokens by their bounding boxes can’t be done with a standard URI + fragment identifier (beginning with a “#” in some URL/URIs).

We’ve asked the Google Books team for help on this issue, and we’re learning that Google may have some web services that could be used reference specific tokens using bounding box coordinates. We should learn more about these shortly. However, for the time being we need an alternative approach to reference specific tokens. One possibility is that a successor to the GAP project can create its own set of Web resources where books, pages, and individual tokens can all carry persistent URIs.

Posted in Uncategorized | Leave a comment

Matching lexicons to gazetteers

I’ve been aware for a while that there was a mismatch between the resources used by the Geoparser for geotagging (finding toponyms in text) and georesolution (determining their lat/long position) – and I’ve now got around to dealing with that. We’ve been trying to use the Geoparser without too much tweaking and reprogramming, but I clearly needed to make the lexicons it uses in the geotagging tie up with the Pleiades+ gazetteer it uses for georesolution.

For place names this is pretty straightforward, as the new lexicon is largely derived directly from Pleiades+. I also needed a lexicon of ancient personal names, as one of the main reasons for poor precision and recall scores on the geotagging seemed to be that there were too many confusions over personal names: there are several places (in the modern world) called Priam, for example.

Dropping the modern place name lexicons altogether improves performance, and adding lists of ancient personal names has helped still further. The overall result is that, although there’s much tinkering we could still do, the geotagging is now producing pretty good results that are fit for our purposes. Compared against the gold standard of our hand-annotated Hestia data, the performance scores (using standard NLP precision, recall and F1 measure) are:

precision (percentage of our tags that are correct): 77.74%
recall (percentage of target we find): 95.58%
F1 (harmonic mean): 85.74%

There’s a simple display of Herodotus Book 1 text at http://synapse.inf.ed.ac.uk/~kate/gap/normname2.display.html. That display only highlights toponyms in the text, but one of the other things we’re playing around with is identifying personal names and temporal expressions. It may be that we can do interesting things with those in GAP, if we can identify them reliably.

The next thing I’m planning to do is to get back to processing actual Google Book texts. I’d interrupted myself on that in order to fix the problems with the geotagging performance.

Posted in Uncategorized | 2 Comments

Some progress on geo-resolution

A quick update on what I’ve been up to recently, and plans for the next couple of weeks:

  • I’ve experimented with combining Pleiades+ with geonames (ie looking up toponyms in both) but, as expected, this floods the results with too many modern places, mostly in the Americas. See http://synapse.inf.ed.ac.uk/~kate/gap/plplusgeonames.html (and zoom out to see whole world).
  • I tried using the “bounding box” feature of the geoparser to indicate a strong preference for locations in Europe and North Africa, which removes many of the American locations. But if there is only one candidate it will be chosen, wherever it is, so this is still not satisfactory for ancient texts. See http://synapse.inf.ed.ac.uk/~kate/gap/plplusgeonamesbounded.html.
  • What does seem to work well is Leif’s idea of using geonames as a source of alternative names to try against Pleiades+. For example, we now find a location for “Egypt” because “Aegyptus” is one of the geonames alternatives and is in Pleiades+. See http://synapse.inf.ed.ac.uk/~kate/gap/plplusgeoaltsdisplay.html.
  • The next thing is to try to improve the positioning we get from Pleiades+. So far we’re getting quite a few returns with no lat/lon position (which the geoparser plots on the equator, as zero/zero). Leif has had a look at the data and explained what’s going on. It looks as if we can fill in many of the blanks by a lookup through to geonames; that’s now on my to-do list.
  • Once we believe we have the system working well, we will need to devise a method for formal evaluation against the Hestia gold standard data, to check whether we are finding the correct locations for toponyms where there are multiple candidates (such as for Salamis, which is the name of more than one place). We’ve discussed ways that this might be approached.
  • In parallel to this work on the georesolution step, I’m working on improving the geotagging, by adding gazetteers of personal names (like Priam, Medea etc) to the process. At the moment “Priam” for example is recognised as a place name because there is a Priam in the USA, and it’s not listed as a common personal name in the references the geoparser uses.
  • I’ve already written scripts for formal evaluation of the geotagging step against the Hestia gold standard, ie to check whether we are identifying the same place names. This is more straightforward than the georesolution evaluation because we can tokenise in the same way both the marked-up data and the plain version that goes through the geoparser, and then compare the two data sets token by token, to produce standard Precision/Recall/F-score measures. As noted in an earlier post, the recall is good but precision is in need of attention.
Posted in Uncategorized | 2 Comments

Visualising some sample results using Pleiades+

Now that the new extended version of the Pleiades name-set based on GeoNames (aka Pleiades+) is available, I’ve altered the Geoparser to use it as a gazetteer in the georesolving step for working out the geographical location of places mentioned in the text. I’ve posted some sample results, for Book 1 of the HESTIA Herodotus, at http://synapse.inf.ed.ac.uk/~kate/gap/plplusdisplay.html. This shows the place-names found in the geotagging step and the location that was ranked first by the georesolver, if there were one or more matches in Pleiades+.

As this sample shows, there are some erroneous “places” (like “Priam”) and some valid places for which no location was found (like “Egypt”). The first issue is to do with improving precision in the geotagging step, as discussed in my last post. The second issue arises because Pleiades+ does not include modern place-names: Pleiades obviously has Egypt in its dataset, but it resides under the label “Aegyptus”. (Pleiades also prefers the Latinised forms of places to the Greek.)

We are currently trying to work out the best way of dealing with the missing place-names problem. One option is to use GeoNames as a default, if no match can be found in Pleiades+: but this solution brings with it the danger of swamping the user with contemporary place-names, of the kind that we wouldn’t expect to find in ancient texts. A neat idea suggested by Leif is to look up missing places like Egypt in GeoNames, and then try to match the alternative names found listed there against Pleiades+. As it happens this would work fine in the case of Egypt – because “Aegyptus” is indeed one of the alternative names – but we wouldn’t find the match in the first place because it’s only listed as an alternative for “Arab Republic of Egypt” not for “Egypt” itself. Over the next week or so I’m going to investigate whether this option is feasible within the geoparser’s architecture. (We may find there are simply too many alternative names to handle, as in general every place name has multiple candidates in GeoNames, each of which has in turn itself multiple alternative forms.)

Another intriguing idea has been proposed by Prof. Bruce Robertson of Mount Allison University (Canada), who acts as a Technical Observer for the Pleiades Project: he wonders whether we could effectively use a Latin Wikipedia, especially given the fact that Pleiades has a penchant for Latin names – and, indeed, in our case, this resource would find Egypt using the string “Aegyptus”. Additionally, at the top of the each page, the relevant entity has Lat/Long given, which at the very least could be used as a sanity check!

In the end, however, it may turn out that we need an enhanced version of Pleiades+ – a Pleiades++ as it were – which would contain the kinds of names that we expect to occur.

Posted in Uncategorized | 3 Comments

A DIALOG between projects: Bridging the GAP to ancient world data

HESTIA has started to use the latest digital technology for the interrogation of geographical concepts mentioned in an ancient historical narrative; GAP builds on this research by pioneering the means to discover ancient places not only in a single text like Herodotus’ Histories but using the entire corpus of GoogleBooks; DIALOG goes one step further still by starting to bring together ancient world research so that different kinds of data related to any given ancient location can be discovered, queried and visualised.

DIALOG (Document and Integrate Ancient Linked Open Geodata) is being funded by JISC (strand 15/10, Infrastructure for Education and Research: Geospatial) and will run from 1 February to 31 October 2011. Employing Linked Open Data (LOD) principles to connect textual, visual and tabular documents that reference places in Ancient World research, it has three primary aims:
i) To define a Core Ontology for Place References (COPR)
ii) To document the process of assimilating place references and publish as Resource Description Framework (RDF)
iii) To develop neo-geographic Web services and tools that can make the published RDF available easily consumable by learners, educators, researchers and the public.

We believe that, by using LOD principles to connect geo-situated textual, visual and tabular documents (hence LOG: Linked Open Geodata), DIALOG will dramatically empower learners, teachers and researchers in seeking to find and use geospatial data and services.

Led by HESTIA and GAP’s Elton Barker (Classical Studies, The Open University) and Leif Isaksen (Archaeological Computing Research Group, Southampton), in collaboration with the JISC-funded project LUCERO (The Open University), DIALOG brings together an experienced, international and interdisciplinary consortium of pre-established teams that use geospatial information technologies for Ancient World research. They are (with datasets in parentheses):

Perseus, Tufts (XML-encoded free-text)
GAP (narrative free texts)
Supporting Productive Queries for Research, KCL, London (fragmentary free- texts)
Arachne, Cologne (database records of material finds)
Digital Memory Engineering, Austrian Institute of Technology (rasterized maps)

DIALOG partners will exchange data, practices and experience with each other as they align their place referencing to the Uniform Resource Identifiers (URIs) produced by the Pleiades gazetteer of ancient places. In this way, when one project points to a particular ancient location, it will be possible for the user to find out what other datasets also refer to that place, and bring that information to bear on their analysis of it.

We are immensely grateful for all their support (and from others) in putting this successful proposal together and we look forward to working with them!

More anon

Posted in Uncategorized | Leave a comment

Geoparsing the HESTIA text of Herodotus

I come to the GAP project from having worked on the Edinburgh Geoparser (see e.g. www.inf.ed.ac.uk/publications/online/1360.pdf), a tool that comprises of two main components: 1. The geotagger – which finds placename mentions in the text, and 2. The georesolver – which determines the geographical position of the places, where possible.

I’m responsible for processing the GAP texts with the Edinburgh Geoparser, starting with the HESTIA text of Herodotus. There are two initial tasks to confront:

First, I’m evaluating how well the geotagging is likely to work on GAP texts by using the HESTIA text of Herodotus as a test case: my assumption (which may be false!) is that the Herodotus text is reasonably representative of what we’ll be dealing with, in terms of places and peoples mentioned. It allows us to do a formal evaluation since it has been marked up in TEI and verified by hand. I’ve therefore written an evaluation routine that checks the recall, precision and F-score (the harmonic mean of P and R) of the geotagger’s output, compared with HESTIA’s Herodotus gold standard. (I strip the markup off the hdt_eng-p5-2.xml file, run it through the geotagger, then compare the output against the original.) So far the issues I’m hitting are that, whilst recall is good (81.47%), precision is poor (39.29%). From a brief study of the results, this seems to be largely due to the tagger misclassifying personal names as placenames: to counter the problem I plan to build a gazetteer list of personal names extracted from the Herodotus text, and amend the parser to use it. We’ll see whether that helps to achieve a more satisfactory precision rate..

Second, I’m in the process of loading the latest version of the “Pleiades+” placename gazetteer (provided by Leif – see previous blog) into a MySQL database and amending the Geoparser to query it for the georesolving step. The output here will be text (still using the HESTIA Herodotus) with place names that can be plotted on a map – only approximately, however, because we’re using Pleiades’ Barrington Atlas grid-square centres for all cases except where geonames gives a better fix.

Next plans
i) Once I’m happy with the precision rate, I’ll try altering the Geoparser so that it can take advantage (in step b) for two distinct geographical gazetteers, in this case Pleiades+ and Geonames.
ii) Move on to other texts!

PS. I’ve highlighted the two-phase nature of the Geoparser as I want to make it clear that the system will only determine the spatial position of places it identifies in step 1, where it doesn’t have access to the gazetteer. This may be a bit of an issue for GAP texts, where we expect to have a high proportion of ancient place names and personal names, which the geotagging component may struggle to recognise. I’ll think about whether there’s anything I can do, other than what I’ve already suggested for the geotagging evaluation.

Posted in Uncategorized | 1 Comment

Pleiades+ : adapting the ancient world gazetteer for GAP

Pleiades (http://pleiades.stoa.org/), a project that is digitizing the Barrington Atlas of the Greek and Roman World (R.J.A. Talbert, ed., Princeton, 2000), is in the process of putting on line the most extensive and accurate coverage of ancient locations published thus far. As such it will provide the basic gazetteer for GAP’s identification of ancient places in the Google Books corpus. Yet, in its present form there are two significant drawbacks to using Pleiades that GAP initially must overcome.

The first is that Pleiades does not currently provide specific coordinates but only the grid square of each location in the Barrington Atlas. This has implications for the plotting of locations and, consequently, the clustering mechanics upon which GAP’s place resolving algorithm is based. The good news is, however, that Pleiades is working in conjunction with the Digital Atlas of Roman and Medieval Civilization (DARMC) in order to provide these coordinates (see: http://pleiades.stoa.org/Members/sgillies/news-items/first-coordinates-from-darmc). For the time being, then, we will have to make do with a grid square centroid that broadly suffices for calculating the GAP algorithm – but we keep our fingers crossed that very soon we’ll be able to draw upon the specific coordinates used of each ancient location mapped in Pleiades.

The second issue is that Pleiades has only limited support for multiple toponyms for the same location. The employment of synonyms may be caused by a variety of factors – for both historic and linguistic reasons – but this problem is particularly aggravated by the tendency for authors throughout time to use contemporary names for ancient places (e.g. ‘London for ‘Londinium’) in their studies, commentaries and translations. Since Pleiades does not readily contain alternative toponyms, and certainly not all alternatives, many place-references in our corpus of books may fail to be ‘tagged’. In order to mitigate this problem we have sought to align Pleiades with the contemporary digital gazetteer GeoNames, whose database is available for download free of charge under a creative commons attribution license (http://www.geonames.org/). Geonames provides multiple contemporary toponyms as well as frequently the ancient name itself. Although the process is somewhat laborious, and by no means comprehensive, we have found that the main Pleiades gazetteer can be expanded in this manner by approximately 30% (from approx. 31,000 to 43,000 toponyms). We refer to the expanded gazetteer as Pleiades+ (‘Pleiades Plus’).

A technical summary in 11 stages
The following steps were undertaken in order to produce Pleiades+:
1. The latest data dump is retrieved from Pleiades. This includes a table of Pleiades locations, a table of the Barrington Atlas Ids and table of the Barrington Atlas Maps.
2. The latest data dump is retrieved from GeoNames. This includes all data from countries covered by the Pleiades gazetteer, filtered to exclude irrelevant feature types (e.g. Airports, etc.).
3. Alternative toponyms are extracted from the Pleiades and GeoNames gazetteers in order to produce ‘Toponym tables’ which map normalized toponyms to their identifiers (this equates to a ‘many-to-many’ mapping).
4. A table of Barrington Atlas grid squares and their bounding coordinates is calculated from a table of the Barrington Atlas maps.
5. The Barrington Atlas Ids table is expanded to extract the grid square(s) associated with each Pleiades identifier and joined to the Grid Square table in order to access its bounding coordinates.
6. The Pleiades Toponym table is joined to the Barrington Atlas Ids table (and thus in turn to the Grid Square table) in order to ascertain bounding coordinates for each toponym, where known.
7. The Geonames Toponym table is rejoined to the Geonames gazetteer in order to ascertain coordinates for each toponym.
8. The Pleiades Toponym table is aligned with the Geonames Toponym table in cases where the normalized toponyms are the same and the Geonames coordinates fall within the bounding coordinates of the Pleiades toponym. This has the result of matching Pleiades identifiers to Geonames identifiers.
9. The resulting Pleiades-GeoNames matches are then rejoined to the GeoNames Toponym table in order to elicit all the other Geonames toponyms associated with the GeoNames ID.
10. A new CSV file is generated containing i) the original Pleiades Toponym table expanded to include a centroid for each grid square as a proxy location; and ii) the additional toponyms derived from GeoNames.
11. The final list of results contains the following fields:
a. The Pleiades identifier (mandatory)
b. The normalized toponym (mandatory)
c. The unnormalized toponym, (mandatory)
d. The source [Pleiades | GeoNames ] (mandatory)
e. The GeoNames Id (mandatory if Geonames is the source)
f. The Pleaides centroid x, y (where known)

Posted in Uncategorized | 3 Comments

GAP work-in-progress report 1 (Oct 10-Jan 11)

GAP has been running for just over three months, so we thought that now was an appropriate time to pause and reflect where we are in our attempt to extend the discovery and querying of ancient places from the HESTIA ‘gold-standard’ Herodotus (with all places verified by hand) to the unstructured 1.2 million books that comprise the Google Books corpus (where places won’t be identified in the XML mark-up). In the series of posts that follow, each member of the team attempts to summarize the work that they’ve done, the problems they’ve encountered, and the next steps they intend to take.

We believe that there are at least two good reasons for making this activity transparent and for documenting our procedure. First, it’s a useful exercise for us to go through as a team: taking the time at the end of our first quarter together to consider the point that we’ve reached is helpful in bringing back into focus the goals of the project, while alerting us to the direction that the follow-up work needs to take. Second, it gives us the opportunity to think about and process particular issues that may have arisen. One benefit of doing this may be to facilitate overcoming those problems; here, the experience of users will be of particular value, so we welcome any feedback you may have to give! But of even greater importance is the documentation of the problems themselves.

To give a short example: only last week I gave a presentation of HESTIA at a one-day workshop on GIS for historians at the Institute of Historical Research in London (http://www.history.ac.uk/node/2278/), at the invitation of Ian Gregory (http://www.lancs.ac.uk/staff/gregoryi/). One of the key points to emerge from the subsequent discussion (specifically raised by one of the researchers on a recently started project: http://www.ucl.ac.uk/sargon/) was the crucial importance for digital humanists to record the issues that they encountered in the process of delivering on their outcomes: even if they themselves were unable to find an appropriate solution for any given problem, by making a record of the issues raised they might help other researchers (from not necessarily the same Humanities subject area) to avoid making the same mistakes or at least learn from the previous responses attempted. In this sense, the practice and recognition of digital humanities research may come to resemble work done in sciences, in which, even if the experiment ultimately fails, the process through which one goes has a value in and of itself.

With the greater good in mind then, let me allow Eric, Kate and Leif share with you their story so far of GAP in their own words.

Posted in Uncategorized | Tagged , , , , , , , , | Leave a comment