The GAP project explores workflows needed to identify ancient places from unstructured texts (books) so that researchers can reference these ancient places in Linked Data applications. Most of the important challenges that we note relate to problems concerning identifiers of texts, fragments of texts (including individual “tokens”), and place entities. Below we describe some of these issues.
Why Token Identification Matters
Tokens (usually individual words) are fundamental unites in text analysis and entity identification. The clear identification of tokens represents a fundamental need for making text analysis and entity identification an integral part of scholarly practice. The adage of “garbage in, garbage out” applies to textual analysis, and tokenization represents an important first step in many later analytic approaches to texts. The reliability and quality of tokenization processes impacts later downstream analysis.
Various text mining algorithms are far from perfect. Such algorithms often require special “tuning” to suite the book or corpus under study. The results of these processes can and should be questioned. Moreover, researchers may want to apply different sorts of text analysis algorithms to the same texts, perhaps using certain approaches for the identification of historical events or persons, and other algorithms to identify historical places. Researchers will need to combine results of different algorithmic analysis to compare and evaluate the outcomes of different approaches to text mining.
Because of these needs, individual tokens need clear, consistent, and persistent identifiers. Such identifiers could be used to compare and contrast entity identification results. For example the token “Paris” may be identified as a person (a character from the Iliad) by one algorithm, while a different algorithm may identify Paris as a geographic place. Persistent identifiers for tokens can be useful for identifying these conflicting results.
Identifying Tokens in Books
Google Books offers fairly stable URIs to individual books and pages in individual books. We note that the URIs to Google Books and pages could be made more trustworthy if they did not include query parameters, but they are suitable for referencing entities at the granularity of a single page of a given book. If one looks at the HTML markup of the Google Books data, one finds individual tokens (words) bounded by <span> elements. These <span> elements themselves have title attributes that describe bounding boxes for the tokens. Presumably these bounding boxes note the position of the word or token on the scanned image of a page. Google probably uses these bounding box data to highlight terms relevant to a user’s search request.
The bounding box data represents the only identification for specific tokens in the Google Books HTML markup. Unfortunately (for our purposes), Google uses the title attribute and not the id attribute for expressing bounding boxes. Thus identifying and referencing tokens by their bounding boxes can’t be done with a standard URI + fragment identifier (beginning with a “#” in some URL/URIs).
We’ve asked the Google Books team for help on this issue, and we’re learning that Google may have some web services that could be used reference specific tokens using bounding box coordinates. We should learn more about these shortly. However, for the time being we need an alternative approach to reference specific tokens. One possibility is that a successor to the GAP project can create its own set of Web resources where books, pages, and individual tokens can all carry persistent URIs.