Project Data

The preliminary data generated by this project can be downloaded here:

We’re using the Pelagios system of annotations to express these data. The dataset describes our machine identification of certain tokens in 9 books (7 from Google Books, 1 from the Open Library, and 1 from Perseus) as Pleiades place entities.

The place-annotation and identification with a Pleiades URI for one token / toponymn from a book looks like this:

<http://www.google.com/books?id=-C0BAAAAQAAJ&pg=PA17> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.openannotation.org/ns/target> .
<http://www.google.com/books?id=-C0BAAAAQAAJ&pg=PA17> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.openannotation.org/ns/Annotation> .
<http://www.google.com/books?id=-C0BAAAAQAAJ&pg=PA17> <http://www.openannotation.org/ns/hasBody> <http://pleiades.stoa.org/places/766> .
<http://www.google.com/books?id=-C0BAAAAQAAJ&pg=PA17> <http://www.openannotation.org/ns/hasTarget>  <http://www.google.com/books?id=-C0BAAAAQAAJ&pg=PA17#bbox=390,1496,472,1525> .

For Google Books,   we identify specific tokens on specific pages as so:

http://www.google.com/books?id=-C0BAAAAQAAJ&pg=PA17#bbox=390,1496,472,1525

In the URI above, the parameter “id” represents the Google Book identifier for the book, while the parameter “PA” is the page number. The fragment identifier #bbox… is the bounding box for the location of the target identified token on the scanned page of the book. In the HTML version of Google Books used by our project, this bounding box information was expressed in the “title” attribute (coordinates separated by white spaces, not commas). Since one typically doesn’t use the title attribute for a fragment identifier, our URIs represent a bit of a hack. 😉

For the Open Library Tacitus and for the copy of Herodotus from Perseus, we did not yet work out ways to create or preserve fragment identifiers through the whole work flow. The fragment identifiers for these books really have little to do with the data sources, and  have more to do with our own processing. These fragment identifiers will be hard to use to resolve to the correct tokens. Nevertheless, we provide them to offer some useful information about the relative position and order of various tokens in these works.

We release these preliminary data “as is” without warranty in the hopes that others may find it interesting and may help us discover bugs or ways to improve our methods. The data are public domain for reasons we outline here.


CC0

To the extent possible under law,

E.T.E.Barker, Leif Isaksen, Eric Kansa, Kate Byrne

has waived all copyright and related or neighboring rights to the
GAP project data.

Leave a comment