As promised a few posts back, I’m now going to bring the geoparsing report up to date with some notes about how we got on with processing large amounts of text, once we’d finished tweaking the Geoparser setup.
Google kindly supplied the raw html and page images for 24 classical texts we requested, and that forms the bulk of the material we’ve processed. We also wanted to experiment with other scanned books that are available, so I downloaded a version of Tacitus’ Annals from the Open Library (http://openlibrary.org/works/OL1108313W/Tacitus). Open Library offer several formats and I chose the plain text version. In each case – html from Google and plain text from Open Library – I needed to transform the input into valid XML and, as is generally the case with OCR-ed text, there were some issues…
Doing OCR on a large scale introduces spurious characters and many of them will upset an XML parser. Sometimes it’s just a question of removing non-printable characters that are outside the range for the character set being used, but sometimes otherwise valid characters will cause problems – such as mismatched quotation marks, brackets and so forth. When the books that were scanned are old – maybe printed a century or more ago – the number of OCR errors can be very high, as the printing isn’t very clear and the paper is a bit crumbly. In tidying-up the data there’s a trade-off to be made between processing that is ruthless enough to to produce valid XML and the loss of important content that results from too brutal “cleaning”. I have to admit I’m not an expert in this field, though I now know a bit more than I did. 🙂
A further problem is that in both cases (Google and Open Library) the original material has
been scanned directly from the page with minimal or no delimiting of the text from metadata, so headings, page numbers and footnotes are very difficult to distinguish and separate. As much tidying was done as was feasible in the time available, but the results could certainly be improved with cleaner input data. When my pre-processor encountered a page it couldn’t parse, it simply skipped it. The basic bulk cleaning steps (this list is for Google Books input) were:
- Remove bytes outside the range for valid characters (ie obvious OCR errors).
- Capture and preserve any “soft hyphens” (XML entity ­) found, and deal with some, but certainly not all, of the hyphenation problems at line ends.
- Translate HTML tags to valid XML ones (replace “<br>” with “<br />” etc).
- Capture page numbers and insert into the text as metadata in situ (instead of in a separate file).
- Detach trailing punctuation into separate tokens and then use the <span> elements in the input for the Geoparser’s tokenisation process, after inserting required XML markup.
A batch routine was set up to process each book. In the case of Google Books the input is presented as one file per page; for the Open Library example the whole book was in one file, which was split into the component books of the Annals, as the Geoparser performs better with relatively small files. The input texts vary in size but 30 minutes is a reasonable estimate of the average time needed to process each Google Book. The processing can be done in parallel to a large extent.