Pleiades (http://pleiades.stoa.org/), a project that is digitizing the Barrington Atlas of the Greek and Roman World (R.J.A. Talbert, ed., Princeton, 2000), is in the process of putting on line the most extensive and accurate coverage of ancient locations published thus far. As such it will provide the basic gazetteer for GAP’s identification of ancient places in the Google Books corpus. Yet, in its present form there are two significant drawbacks to using Pleiades that GAP initially must overcome.
The first is that Pleiades does not currently provide specific coordinates but only the grid square of each location in the Barrington Atlas. This has implications for the plotting of locations and, consequently, the clustering mechanics upon which GAP’s place resolving algorithm is based. The good news is, however, that Pleiades is working in conjunction with the Digital Atlas of Roman and Medieval Civilization (DARMC) in order to provide these coordinates (see: http://pleiades.stoa.org/Members/sgillies/news-items/first-coordinates-from-darmc). For the time being, then, we will have to make do with a grid square centroid that broadly suffices for calculating the GAP algorithm – but we keep our fingers crossed that very soon we’ll be able to draw upon the specific coordinates used of each ancient location mapped in Pleiades.
The second issue is that Pleiades has only limited support for multiple toponyms for the same location. The employment of synonyms may be caused by a variety of factors – for both historic and linguistic reasons – but this problem is particularly aggravated by the tendency for authors throughout time to use contemporary names for ancient places (e.g. ‘London for ‘Londinium’) in their studies, commentaries and translations. Since Pleiades does not readily contain alternative toponyms, and certainly not all alternatives, many place-references in our corpus of books may fail to be ‘tagged’. In order to mitigate this problem we have sought to align Pleiades with the contemporary digital gazetteer GeoNames, whose database is available for download free of charge under a creative commons attribution license (http://www.geonames.org/). Geonames provides multiple contemporary toponyms as well as frequently the ancient name itself. Although the process is somewhat laborious, and by no means comprehensive, we have found that the main Pleiades gazetteer can be expanded in this manner by approximately 30% (from approx. 31,000 to 43,000 toponyms). We refer to the expanded gazetteer as Pleiades+ (‘Pleiades Plus’).
A technical summary in 11 stages
The following steps were undertaken in order to produce Pleiades+:
1. The latest data dump is retrieved from Pleiades. This includes a table of Pleiades locations, a table of the Barrington Atlas Ids and table of the Barrington Atlas Maps.
2. The latest data dump is retrieved from GeoNames. This includes all data from countries covered by the Pleiades gazetteer, filtered to exclude irrelevant feature types (e.g. Airports, etc.).
3. Alternative toponyms are extracted from the Pleiades and GeoNames gazetteers in order to produce ‘Toponym tables’ which map normalized toponyms to their identifiers (this equates to a ‘many-to-many’ mapping).
4. A table of Barrington Atlas grid squares and their bounding coordinates is calculated from a table of the Barrington Atlas maps.
5. The Barrington Atlas Ids table is expanded to extract the grid square(s) associated with each Pleiades identifier and joined to the Grid Square table in order to access its bounding coordinates.
6. The Pleiades Toponym table is joined to the Barrington Atlas Ids table (and thus in turn to the Grid Square table) in order to ascertain bounding coordinates for each toponym, where known.
7. The Geonames Toponym table is rejoined to the Geonames gazetteer in order to ascertain coordinates for each toponym.
8. The Pleiades Toponym table is aligned with the Geonames Toponym table in cases where the normalized toponyms are the same and the Geonames coordinates fall within the bounding coordinates of the Pleiades toponym. This has the result of matching Pleiades identifiers to Geonames identifiers.
9. The resulting Pleiades-GeoNames matches are then rejoined to the GeoNames Toponym table in order to elicit all the other Geonames toponyms associated with the GeoNames ID.
10. A new CSV file is generated containing i) the original Pleiades Toponym table expanded to include a centroid for each grid square as a proxy location; and ii) the additional toponyms derived from GeoNames.
11. The final list of results contains the following fields:
a. The Pleiades identifier (mandatory)
b. The normalized toponym (mandatory)
c. The unnormalized toponym, (mandatory)
d. The source [Pleiades | GeoNames ] (mandatory)
e. The GeoNames Id (mandatory if Geonames is the source)
f. The Pleaides centroid x, y (where known)