Even more unlocked

Just a quick update on the Unlock Text interface I posted about last time… Eric noticed that the ‘gazref’ attribute doesn’t point back to the original source gazetteer – you needed a second interrogation of the server to get that info. That’s now been changed and an extra attribute has been added to the output for locations: ‘source-gazref’.

So, for example, ‘Halicarnassus’ appears in the output (the .lem. files, as explained in previous post) as:

<ent feat-type="other" gazref="unlock:14186862" id="rb1" in-country="" lat="37
.25" long="27.25" pop-size="" source-gazref="plplus:599636" type="location">
 <parts>
 <part ew="w67" sw="w67">Halicarnassus</part>
 </parts>
 </ent>

The ‘gazref’ attribute is a local Unlock reference, whereas the ‘source-gazref’ points to Pleiades+ (which was specified as the gazetteer for this run) and corresponds to http://pleiades.stoa.org/places/599636, which is indeed Halicarnassus.

Note, incidentally, that the ‘parts’ entity has attributes indicating the start word (‘sw’) and end word (‘ew’) of the string identified as a ‘location’ in the tokenised text. The .lem. output files present the entire input text in tokenised format first, followed by standoff xml pointers for the entities, such as locations, that the geoparser found.

Many thanks to Colin at Edina and Claire at LTG for fixing this. We now have an api that will allow you to specify that you want the Pleiades+ gazetteer, and link placename mentions found in your input text directly back to the corresponding entry at Pleiades.

Posted in Uncategorized | Leave a comment

Unlocking Text

We haven’t posted in a while. Things have been happening… This is just a brief update on some recent work I’ve been doing with the geoparser. If I don’t make it brief it won’t happen at all.

Way back in the summer we had a meetup in New York (roughly the geographical centroid of our team!) and one of the things I was working on then was experimenting with broadening the range of texts we could process. We started out of course with ancient places, and we know the geoparser is happy enough with modern places, so we thought we’d have a go at places in fiction. I had some fun with Around the World in 80 Days and A Tale of Two Cities, for example.

However, one of the issues we’ve been needing to sort out for a while is getting the geoparser properly available on the web so that all this processing of texts doesn’t have to happen on my local machine. That will open the door to the collaborations we want to foster, eg with Project Bamboo, because people will be able to call the geoparser remotely, as an element in their own processing pipelines.

I’m happy to say that this has now become a possibility, thanks to work by Edina and the Language Technology Group at Edinburgh. Edina maintain the Unlock Text api which is the web face of the Edinburgh Geoparser. It has a REST interface that allows you to POST texts using your local URLs and GET the geoparsed results in xml or json format. The documentation page on the site explains how.

There are a couple of important points that the documentation doesn’t yet mention. The first is that, as well as getting a list of placenames found in your text as output, you can also get the full text back with the placenames annotated as standoff xml. This is in the ‘.lem.xml’ and ‘.lem.json’ output files. In fact you get a whole lot more stuff besides placenames as we also pick out other entities and relations between them – see the LTG page mentioned above for information about the full range of pipeline outputs. But if you just want placenames look for the <ent> elements with attribute ‘type=”location”‘. These elements point back to the relevant string in the text and include the latitude and longitude and a reference to the source gazetteer entry.

The second point to note is that you can specify which gazetteer you want to reference your text against. If you don’t specify then it uses all the ones it knows about, currently ‘unlock’, ‘os’, ‘geonames’, ‘naturalearth’ or ‘plplus’. This can produce interesting information but often you’ll want to use a particular one. In the case of classical texts that will be Pleiades+ (‘plplus’) which references the Pleiades gazetteer as we’ve explained in earlier blog posts. The gazetteer can be specified in the json data item, following the name of the input file, as a ‘gazetteer’:’gazname’ pair, like this:

{"src":"http://synapse.inf.ed.ac.uk/~kate/gap/unlockTests/book9plain.txt",
"gazetteer":"plplus"}

As for input formats, the Unlock Text interface will take plain text or html (though it may struggle with very complex html) and it will do its best with pdf.

This is all still a bit of a work in progress, but we are making progress. :)

Merry Christmas!

Posted in Uncategorized | Leave a comment

The Story Continues…

It’s time to announce some great news. Google have been very happy with the first phase of GAP and so have kindly agreed to provide us with additional funding to take us into new territory. This is exciting for a number of reasons. The first is obviously that we, too, have been very happy with GAP’s direction of travel, especially with regard to our partnerships with other projects and teams – particularly Pelagios, Pleiades, the Edinburgh GeoParser and Bamboo. These have been strong collaborations to the benefit of all parties and we look forward to continuing them. The second is that in the next phase there is much less ‘preparatory’ work to do. Last year’s results have given us a solid foundation that allows us to delve straight into new issues. These are likely to include:

  • Greatly expanding the number of texts. There are currently 27 texts (including several duplicates) on GapVis. In the next months we intend to increase that to several hundred.
  • Using the data from multiple books both to help automatically identify different versions of the same text, but also to explore ‘self-correcting’ algorithms that notice mismatches between the place annotation lists and automatically amend them.
  •  Improving the crowd sourcing aspects of the project. Automatically generated data can never be perfect. We already have limited functionality for user comments but we’d like to develop a simple but robust system so that users can correct false identifications and even propose new ones.
  • Humanistic research. We’d intend to take on a few humanistic research questions and see whether our tools help to address them. Do classical poets really have a fixation with Arcadia? How do antique Geographers choose to serialize a two dimensional geo-space into a one-dimensional text?
  • More collaboration. We are not the only project to be doing geo-annotation (Pelagios is evidence of that). We will continue to compare and connect our results to work being undertaken by others.

Finally, we are delighted to say that our merry band has now formally expanded to five as Kate Byrne and Nick Rabinowitz join us as official Investigators on the project. Of course, they were fundamental to the success of the first phase of GAP and this new role properly recognizes that fact.

We can’t wait to get going on activities that will open up yet more ways of exploring and interacting with Ancient World and we look forward to your ever helpful advice, comments and reflections along the way.

Regards, The GAP Team

Elton Barker
Kate Byrne
Leif Isaksen
Eric Kansa
Nick Rabinowitz

Posted in Uncategorized | Leave a comment

Designing a Visual Interface for GAP Texts

In my last two posts, I covered some of the technical approaches we used to develop GapVis, a visual interface for exploring GAP texts. In this post, I’ll discuss some of the interface and visualization choices we made in designing the application.

Our starting point, as with the technical work, was the prototype interface I built for HESTIA. This was a single-screen application with various components to help the reader navigate through the Histories of Herodotus, including a map, a narrative timeline, the text of the page, and a set of navigational controls. As we started to brainstorm all of the new features we might add in adapting this for GAP, we quickly realized that presenting every widget and visualization to the user on one screen would be visually overwhelming. With that in mind, we broke up the application into three “views”, focused respectively on one book, one page, and one place:

  • The Book Summary View, which provides a perspective on the text as a whole;
  • The Reading View, based on the original interface, which focuses on a single point in the narrative and is meant as an enhanced interface for reading the text; and
  • The Place Detail View, which offers more information on how a particular geographic location fits into the text.

In each of these views, we present a set of interface components providing textual detail, visual analysis, and navigation controls, including:

Google Maps: Each view has a map, but each map functions differently according to the focus of the view. The Book Summary has a static map of all places referenced, giving a quick view of the book’s geographic scope; the Reading View only shows places in the narrative vicinity of the current page, fading them out as they fade from the reader’s attention; and the Place Details view shows a network map of related places, based on co-reference (i.e. places that appear on the same page) within the current text. In each of the maps, the dots marking each location are colored according to reference rank, with highly referenced places shown in red while rarer places are a darker purple.

Place Frequency Bars: These are essentially sparklines of textual references, showing where in the narrative each place is mentioned. A good example is the frequency bar for Carthage in Gibbon’s Decline and Fall of the Roman Empire - the city has a lot of references in the beginning of the text, then fades out of the narrative later on. The Book Summary includes frequency bars for all referenced places, allowing you to compare and contrast, while the Place Details view emphasizes the bar for the current location.

Narrative Timeline: The Reading View includes a narrative timeline (using the SIMILE Timeline widget), showing the current narrative location and the places referenced on nearby pages. The timeline is linked to the map, so that locations are hidden as they move out of the timeline’s visual area.

Navigation Controls: Each view has a set of navigation controls allowing the user to move either within the current view (e.g. switching to a different page or geographic location) or between views, allowing users to change their area of focus. Almost all place or page references are also navigation controls, offering a link to more details. For example, clicking on a point in a place’s frequency bar will jump to that point in the narrative, with that place highlighted.

Screen Transitions: The ease of navigation carries an associated risk of confusing the user by jumping from view to view, which we’ve tried to mitigate with the screen transitions. Each screen slides in logical order from left to right, as do the pages in the Reading View, helping users to situate themselves in the application – a design pattern cribbed from mobile apps (and supported by increasing user familiarity with these interfaces).

There’s a lot more we’d do if we had the time – for example, we have views for a book, a page, or a place, but not for the entire corpus or for a single author, both of which might be interesting. And there are still a number of tweaks and refinements we’ll probably get around to adding eventually, like better interstitial loading screens. It’s definitely a work in progress (we even added “beta” to the header, which is basically the Web 2.0 equivalent of a little animated man with a shovel), but we hope it’s solid enough to test out some of these approaches and get some feedback on what works. Let us know what you think!

Posted in Uncategorized | Leave a comment

Geoparsing Google Books and Open Library texts

As promised a few posts back, I’m now going to bring the geoparsing report up to date with some notes about how we got on with processing large amounts of text, once we’d finished tweaking the Geoparser setup.

Google kindly supplied the raw html and page images for 24 classical texts we requested, and that forms the bulk of the material we’ve processed. We also wanted to experiment with other scanned books that are available, so I downloaded a version of Tacitus’ Annals from the Open Library (http://openlibrary.org/works/OL1108313W/Tacitus). Open Library offer several formats and I chose the plain text version. In each case – html from Google and plain text from Open Library – I needed to transform the input into valid XML and, as is generally the case with OCR-ed text, there were some issues…

Doing OCR on a large scale introduces spurious characters and many of them will upset an XML parser. Sometimes it’s just a question of removing non-printable characters that are outside the range for the character set being used, but sometimes otherwise valid characters will cause problems – such as mismatched quotation marks, brackets and so forth. When the books that were scanned are old – maybe printed a century or more ago – the number of OCR errors can be very high, as the printing isn’t very clear and the paper is a bit crumbly. In tidying-up the data there’s a trade-off to be made between processing that is ruthless enough to to produce valid XML and the loss of important content that results from too brutal “cleaning”. I have to admit I’m not an expert in this field, though I now know a bit more than I did. :)

A further problem is that in both cases (Google and Open Library) the original material has
been scanned directly from the page with minimal or no delimiting of the text from metadata, so headings, page numbers and footnotes are very difficult to distinguish and separate. As much tidying was done as was feasible in the time available, but the results could certainly be improved with cleaner input data. When my pre-processor encountered a page it couldn’t parse, it simply skipped it. The basic bulk cleaning steps (this list is for Google Books input) were:

  1. Remove bytes outside the range for valid characters (ie obvious OCR errors).
  2. Capture and preserve any “soft hyphens” (XML entity &shy;) found, and deal with some, but certainly not all, of the hyphenation problems at line ends.
  3. Translate HTML tags to valid XML ones (replace “<br>” with “<br />” etc).
  4. Capture page numbers and insert into the text as metadata in situ (instead of in a separate file).
  5. Detach trailing punctuation into separate tokens and then use the <span> elements in the input for the Geoparser’s tokenisation process, after inserting required XML markup.

A batch routine was set up to process each book. In the case of Google Books the input is presented as one file per page; for the Open Library example the whole book was in one file, which was split into the component books of the Annals, as the Geoparser performs better with relatively small files. The input texts vary in size but 30 minutes is a reasonable estimate of the average time needed to process each Google Book. The processing can be done in parallel to a large extent.

Posted in Uncategorized | Leave a comment

Building a Single-Page Application for GAP, Part 2

In my last post, I started setting out some of the work we did to create GapVis, an online interface for reading and visualizing GAP texts. In this post, I’ll go a bit more into the technical details of the application, which uses the Backbone.js framework.

A lot of the process of building a web application like GapVis (at least the way I do it) is about iteratively coming around to a solid architecture. For example, I started off without storing application state in any single place; but I discovered I was rapidly entering a tangled web of cross-referenced function calls in which too many parts of the application had to be aware of each other. Eventually, I arrived at the “global state” pattern, allowing different pieces to be nicely independent of each other by having everything listen to events on a single State model.

At the same time, I realized that in many cases coordinating the different components would be easier if “parent” components were responsible for their children, so I added some structure to make this simpler. As I went along, I noted the choices I was making, to help me follow a consistent pattern as I added new components:

 Basic architecture:
 - Models are responsible for getting book data from API
 - Singleton state model is responsible for ui state data
 - Views are responsible for:
    initialize:
    - instantiating/fetching their models if necessary
    - instantiating sub-views
    - listening for state changes
    - listening for model changes
    render:
    - adjusting the layout of their container boxes
    - creating their content
    events:
    - listening for ui events, updating state
    ui methods:
    - updating ui on state change
    - updating ui on model change
 - Routers are responsible for:
    - setting state depending on route
    - setting route depending on state
    
 Process of opening a view:
 - URL router or UI event sets state.topview to the view class
 - State fires topview:change
 - AppView receives event, closes other views, calls view.open()
 - view clears previous content if necessary
 - view either renders immediately, or fetches data and renders

This is actually taken straight from the code comment I used to keep track of it. While this kind of documentation is usually used to enforce consistency across multiple programmers, I found it helpful for my own use as well, if only as a way of forcing me to solidify my architectural choices.

The last piece of the process was developing a build system. I use Apache Ant to help automate build tasks that I might otherwise have to do manually again and again. One example is managing the script tags for my Javascript files. My application is broken up into many files so that I can keep the code neatly organized. I keep a list of these scripts in a properties file, and create the script tags in my index.html file automatically. This saves typing, but it also allows the same lengthy list of files to be used in a separate build process for production code. Keeping many separate files with verbose variable names and comments is helpful to me, but it increases the load time of the application significantly. Using Ant, I can automate a “deployment” task that compresses these scripts and sticks them together into a single, fast-loading file for the end user.

The end result is a Javascript-driven app that has several advantages over traditional server-side applications. First, it has comparatively minimal bandwidth requirements, because after the initial load, the only additional data that needs to be loaded is the raw book data from the GAP API. Not requiring a page reload means the application can be very fast and responsive to user input. It also means we can manage animated transitions between view states (e.g. sliding pages to the left or right), which help users maintain a sense of where they are in the application (and, let’s face it, look cool). The application is portable – it’s all front-end code, so moving it to a new server is a matter of minutes. And because building the application this way forces us to define a well-thought-out API for the data, we can offer that API to other applications (our own or someone else’s) with no additional work.

I hope that wasn’t too much technical detail (and if you want more, let me know – I’m happy to answer questions!). The last post will talk more about some of the interface and design choices we made, and how we think this kind of interface enhances and deepens the experience of reading GAP texts. And if you haven’t done so yet, please check out the GapVis application and let us know what you think!

Posted in Uncategorized | 5 Comments

How we tweaked the geoparsing

In my last post I promised some updates, so here’s a short report on the first two points:

  1. Improving the geotagging, so that we can find ancient place-name mentions more accurately in the input texts.
  2. Improving the georesolution, so we can plot more of them on the map, and in approximately the right places!

The geotagging work was outlined in an earlier post so I’ll just recap very briefly and bring us up to date. Geotagging is an NER (named entity recognition) process, and we actually recognise and categorise other classes (such as PERSON) as well as PLACE, to help discriminate between them. In the early stages we were getting good recall (we were correctly spotting over 80% of the Hestia “gold standard” place-name mentions) but very poor precision – only 40% of what we said were place-names actually were. The NER is done partly on linguistic clues (part of speech etc) and partly using lexicons – look-up lists of instances of the classes to be recognised, such as people or places. Most of our poor precision was because personal names, like “Priam”, were being classified as places; there are several places called Priam in the world, and they were in our lexicons.

What was needed was a new set of lexicons, tailored for ancient places and people. The place-names lexicon was easy – just base it on Pleiades+ – and the people lexicon was built from material available at http://www.eisenbraums.com, augmented by extracting marked-up personal names from the Hestia data. Except for the formal evaluation over the Hestia gold standard data, we also added the marked-up Hestia places into the mix, to be used over general input texts. There was a bit of fiddling and tweaking but basically this simple change did the trick, and brought our geotagging precision up to 87%, whilst maintaining acceptable recall, at 73%. (Hence an F-score of 79%.)

As for the georesolution step, the various ideas we explored were described in a previous post, so I’ll now just explain what worked best and why. We don’t have a way of formally evaluating the georesolution in terms of precision and recall, because we had no marked-up ancient text available with spatial co-ordinates defined. (And in any case, matching one point location against another can be a tricky business – see Evaluation of Georeferencing for discussion.) But from simple visualisations of the data it was clear that if we used only Pleiades+ we missed quite a lot of places, because translators often use modern names (like “Egypt”) whilst Pleiades is all about ancient names (“Aegyptus”). If we tried to solve this by consulting Geonames as well then we were swamped with spurious modern places (like the “Priam” in Minnesota).

The neat solution that Leif came up with is to exploit the list of alternative names provided in Geonames. If we draw a blank on “Egypt” in Pleiades+ we can try it in Geonames and collect a set of alternatives that we then try on Pleiades+. One of them is indeed “Aegyptus”, so we score. There’s a risk that quite separate places may have alternative names in common, but overall this solution seems to work very well and it’s what we’ve used in the final results shown in the visualisations Eric and Nick have produced.

Of course the automatic geoparsing process will never be completely accurate, and its two-stage nature means that some errors are compounded. If a place-name is missed at the geotagging step there’s no opportunity to attempt georesolution for it. There are some compensations in the other direction however, and the GAP work is only making use of a subset of the Geoparser’s functionality. For example, “Phoenicia” is found by the geotagger  but can’t be resolved because it’s not in Pleiades+ and the lookup for alternatives in Geonames also fails. (“Phoenice” is in Pleiades+ but is not listed as an alternative for “Phoenicia” in Geonames). However, because the Geoparser uses linguistic clues as well as simple lookups, and in fact outputs more information than GAP uses (such as extra entity classes, and relations between them), the Geoparser’s full output includes the information that Tyre (which is successfully located) is in Phoenicia. There is scope in future work for exploiting relationship information of this kind, which in this case would include a clue to where Phoenicia is.

Next time I’m here I’ll finish off on my end of things with a brief report about processing input texts on a large scale.

Posted in Uncategorized | Leave a comment