Visualizing Associations of Place Names in Texts

In my last post, I discussed some early analysis, based on the HESTIA project, on place-names in the Histories of Herodotus. After lots of grief with memory and other server configuration issues, I finally managed to deploy a preliminary interface and visualization to these analyses. One thing to note.

The more red the place-marks and lines, the stronger the association of between place terms. The GoogleMaps rendering loads pretty big / complicated KML files in the background. It’s sometimes not too smooth and may take a page refresh or playing with the zoom controls before you can see something.

  1. Here’s an example for Byzantium (direct link to a large map)
  2. Here’s another example for a more distant place term (from Herodotus’ perspective), Palestine (direct link to a large map)

Notice, in the case of Palestine, the strongest associations with other place terms go to Egypt and Syria. These are two other distant regions from Herodotus’ perspective.

The ‘Strength of Association’ metric we’re playing with is based on the following method. Each token in the Histories has an index number, counting from 1 at the beginning of the text to 241,950 at the end of the text. We use these index numbers to calculate distances between toponymns.

Right now, we’re using an inverse square relationship to calculate strength of association. We chose this for no particular reason, except that it seems to nicely weigh very close co-occurance of toponymns much more than more distant co-occurances (gravity, light, and other physical phenomena relate nicely to inverse square laws). The closer two toponymns occur in the text, the stronger the association.

Next up: Referencing Pleiades place entities!

Please note! These results are preliminary and exploratory. We’re playing with methods and make no claims that this approach has any particular analytic value.

Posted in progress | Leave a comment

Connected Histories and Data Mining: New Tools for the Digital Humanities

Bob Shoemaker (Professor of Eighteenth-Century British History, University of Sheffield), one of the primary academics behind the Old Bailey on-line archive (http://www.oldbaileyonline.org/), recently came to address the Open University’s Digital Humanities seminar (22 November 2010). Having been set up with an AHRC grant and lottery funding, but going now for some eight years, the Old Bailey project has so far involved the manual keying-in of documents (from the Proceedings and Ordinary’s Accounts) and their ‘mark up’ in XML to provide access to key bits of information (crime, date, location, defendant, victim, judges, etc). But Bob hadn’t come to talk to us about how great the website was, though it has won or been nominated for several awards. Instead, he wanted to talk about their plans to address a couple of limitations they had identified: its lack of links to other resources and its inability to use the data that it does have. Both points have relevance for Digital Humanities and, more specifically, the GAP project.

1. Linked Data

His initial response to the lack of link up had been to look to add new datasets that would supplement the Old Bailey records, such as parish archives and criminal records. But this centralizing tendency came at a cost—in terms of both time and labour. Instead, his team hit upon using a federated model: i.e., not a website that would house all the data itself, but a portal that, using a federated search facility, could point users to original websites specializing in the relevant data. One such portal that Bob has been involved in developing is Connected Histories (http://www.connectedhistories.org), due to be launched in March 2011, which will facilitate the discovery of a wide range of distributed digital resources relating to early modern and nineteenth-century British history.

But, as Bob explained, a major obstacle to research remains even with this more devolved search facility: the kinds of searches that can be done—by name, place, date, keyword, etc—are still limited by being predetermined. What was needed was to find a way of approaching the data without preconceptions and assumptions.

2. Text Mining

The answer, according to Bob, was to make use of data mining tools that could extract meaningful patterns from masses of data, which could then be analysed. Keyword searches tend to produce too many results; or else the ranking of results can be open to question. (Google, for example, ranks search results according to popularity: which is fine if you’re a consumer looking for a product, but is less well suited to the academic researcher, who, if anything, is looking for the data which is less well known and poorly excavated.) Keyword searches, moreover, have been talked about in terms of looking for needles in haystacks: of potential more use to the scholar would be a search facility that could point to the shape and size of the haystack itself…

This is work-in-progress for all concerned (Bob pointed us to another Old Bailey spin-off, http://www.criminalintent.org, which heralds the beginning of ‘drilling down into the data’): but Bob gave a useful run-down of the tools that could assist the Digital Humanities researcher in making more sensitive enquiries by approaching the data without preconceived questions. Three he mentioned are:

i) Zotero (http://www.zotero.org/): a citation management tool that maps word frequencies to create a cloud (wordle), which highlights prominent themes;

ii) TAPoR (http://portal.tapor.ca/portal/portal): the Text Analysis Portal for Research maps word usage over time, including peaks (or ‘trends’), density, collocations, and types (unique words);

iii) Compression Analysis: this tool measures degrees of similarity between texts based on repetition of word patterns (a ‘more like this’ function), and learns from experience…

Bob’s talk has given me much food for thought. The two areas which he identified as the next stage for his own project—linked data and text mining—are two areas which I can see being very relevant for the kind of work that we at GAP would like to do with ancient places: linking the ancient places in Herodotus’ Histories to other datasets (whether they are other ancient textual sources, or secondary scholarship, or even artefacts); and finding out more about the citation patterns in Herodotus (and other authors) for ancient locations, such as their collocations with other places/nouns or the verbs that connect them. But that’s for another blog…

Posted in Uncategorized | Tagged , , , , , , , | 1 Comment

Filling in some Gaps in GAP

Sometimes there’s nothing like a little face-to-face time to get a project going, even one that’s dealing predominantly with on-line material and involving collaborators from around the globe. So last week I flew to the UK to work first with Elton and Leif in Oxford, and then with Kate in Edinburgh. It was surprising just how much we could achieve in one week, and how much fun we could have in the process!

First off, it’s important to highlight the extent to which GAP derives from an impressive body of prior work; in fact a large part of this initial visit for us all was spent finding out about, understanding and marshaling the resources available to us, which we will be able to use to identify ancient places in the Google Books corpus. For example, I quickly learned that:

HESTIA had already identified some 757 places in the Histories of Herodotus. HESTIA itself had built upon prior digitization efforts of the Perseus Digital Library at Tufts University, from which HESTIA had got the digital text of Herodotus in the first place;

The places identified in HESTIA are only the tip of the iceberg: with GAP, we want to try to identify far more places in far more books (ultimately the entire Google Books corpus). To do this, we need a larger database of places. Fortunately, the prior (and on going) work with the Pleiades Project provides this larger database.

In many ways, HESTIA represents a microcosm of what we hope to achieve with GAP. Because it has already identified a good number of ancient places in one text, it provides an excellent dataset for experimentation. Therefore we intend to use the HESTIA data to test ideas on what we might be able to do with the potentially thousands of ancient places that may be identified in many thousands of books from the Google corpus.

Identifying places is one thing, what to do with them once identified is another. Since I am a novice to HESTIA and the Histories, I thought it may be interesting to see whether there was any meaningful relationship between where placenames appeared in the text and the geospatial connections between those places. It turns out that there may well be a number of significant relationships, but that they reflect many different semantic dimensions which go beyond geospatial proximity alone. For example, the following placenames appear in proximity to the placename “Byzantium” in the Histories:

Bosporus (9), Hellespont (6), Miletus (5), Plataea (3), Euxine (3), Thrace (3)

The first two places, the Bosporus and Hellespont, reflect places closely related in geographic proximity to Byzantium. However, the following placenames appear in proximity to “Sparta”, a place that is much more prominent than Byzantium in the Histories:

Lacedaemon (50), Athens (45), Delphi (42), Hellas (39), Aegina (37), Asia (21), Susa (21)

In the case of Sparta, places that appear nearby in the text of the Histories tend not to have much to do with geospatial proximity. Instead, the place terms seem to be much more indicative of key political and military relationships having to do with the Greco-Persian Wars. (“Lacedaemon” here is a special case, since it is a synonym for Sparta.)

It will be interesting to see how these results from HESTIA and the Histories compare with the much larger Google Books corpus. Here are some issues we want to explore:

Will the proximity (in text) of placenames show any historical significance?

Will we be able to note changes in the geospatial orientation of research in Classics over the centuries? Can we even identify different national traditions of classical scholarship, where, for example, certain regions may be favored for discussion in German literature as opposed, say, to French or English?

But, first things first, we have to identify the ancient places. In looking forward to scaling up beyond HESTIA and the Histories to other books in the Google Book corpus, we spent a great deal of effort on reconciling HESTIA places with the places currently being compiled by the Pleiades project. The Pleiades gazetteer is largely based on the Barrington Atlas, arguably the key reference work on Greco-Roman geography. Pleiades has amassed a database of some 35,000 places. Our first major task has been to relate all of the HESTIA places with those covered by Pleiades, which we have just about done! Not only does this allow us to cite Pleiades (a key Web-based source), but it also means that we can include different toponyms (from HESTIA) to variants provided by the Pleiades Gazetteer. Along these lines, we have also related some additional 47,000 toponyms (from GeoNames) to Pleiades place entities. In doing all of this, we have a much richer database of toponyms that we can use when we index the Google Books corpus.

Posted in progress | Leave a comment

Talking Digital Infrastructure @ the ESF, Strasbourg

Back at the end of October I attended a two-day workshop put on by the European Science Foundation (yes, Humanities research counts as Science in the European Union). The Workshop addressed the issue of Research Communities and Infrastructures in the Humanities as they are developing in a digital context. Researchers from all around Europe and involved in various aspects of Digital Humanities were invited to talk about their experiences in 5 thematic sessions. I was there, representing both HESTIA and GAP, to share my experiences of working in an interdisciplinary group, in session 4 below (for a downloadable pdf of the programme, go to: http://tinyurl.com/37xvg6c):

1. Research communities and adoption of research infrastructures, both traditional (e.g. museums, libraries, archives) and digital, with a particular focus on the extent and nature of their integration
2. Re-purposing and re-use of data in digital form and how this process can engage researchers
3. Text vs. non text (audio and material) digitization in changing research practice
4. Disciplinary vs. interdisciplinary resources, and the opportunities afforded by digital technologies
5. Integrating extant resources: the role of digital research infrastructures

The ‘wrap up’, which runs to several pages and which will influence strategy on a European-wide level (see: http://tinyurl.com/2wbrtjj) , for me raised three key issues.

First, and perhaps most encouragingly, the ESF committee made it clear that they didn’t see it as their job to establish or enforce an infrastructure themselves: instead, there was an acknowledgement that any future, stable infrastructure had to be community-driven and would work better at local levels with user input. The role of pan-European bodies like the ESF lay, rather, in support, by maintaining scholarly standards (essential for the reusability of data), overseeing transparency of methods, ensuring recognition of digitally-based work in publication records and promotion cases, facilitating co-operation between groups/individuals, and helping to establish best practice guidelines. A tangible part of this guidance would be in offering training, so that all academics could develop a working competency in the field, though whether this would change the way or even the kind of research that was done was less clear: after all, it was said, carpenters may have put down their hand tools for electric ones, but their work largely remains much the same. Still, there was a sense that this new medium could affect the way that Humanities scholars go about their business, particularly in processing data or allowing access to raw data in a much more transparent manner. Funding agencies too had an important role by requiring sustainability, reusability and compliance with standards.

Second, it was recognised that the wheel should not have to be reinvented continually, meaning that there had to be better joined-up thinking across the pan-European institutions to ensure that academics working across disciplinary boundaries could learn from each other. As well as tools, methods and practice, data also needed to be shared, rather than being stored in ‘data silos’. The challenge, then, is to find ways of linking datasets. One solution proposed would be to embed metadata to provide a common ontology for each and every digital object, which could be recognised as generic and re-usable. But above all the emphasis was on making the data, tools, methods and practice accessible and open.

Lastly, it was felt that the digital medium presented an ideal opportunity to appeal to a much broader constituency beyond a narrow single-discipline academic circle. It would not only be the case of creating tools and methods, or presenting data, which are easy for all scholars to adopt and use; it would also be possible, and desirable, to develop the means of communicating the latest cutting-edge research to the general public. In fact, Humanities scholars, like their better known colleagues from the Sciences, could even play a role in shaping educational and social policy. It is certainly true that computer scientists are keen to work with us, for they recognise that the kinds of questions that we typically ask of data has the potential to extend the latest computing technology into exciting new areas.

As a Humanities scholar, who has been part of an interdisciplinary team for the past two years, I can readily testify to the challenges that such work presents but, especially, to its benefits and excitement. Work has never been so much fun.

Posted in Uncategorized | Tagged , , , , | Leave a comment

Taking a GAP year

Google has so far digitized over 12 million books in over 300 languages, much of which was previously available only in prestigious university libraries. The amount of data now available, then, is enormous, which is both very exciting and has huge potential for us as researchers, but, frankly, is quite bewildering in scope. What’s there? And how can it be used? In a call that went out in April of this year (2010), Google threw down the gauntlet to the academic community to come up with some suggestions.
That’s where the GAP (or Google Ancient Places) project comes in…

As a team of experts drawn from the fields of Classical Studies, Archaeology and Computing, we―that is Elton Barker (The Open University) , Leif Isaksen (Southampton) and Eric Kansa (Berkeley)―aim to address these two primary concerns, the what and the how, first by pioneering a search-facility that facilitates the discovery of data that is of general interest to humanities scholars (in this case, locations associated with the ancient world), and then by experimenting with ways of visualizing the results.
So with GAP you’ll be able to discover all references to a particular ancient location, and then visualize the results in GoogleEarth to gain a unique snapshot of the geographic spread of the references. Or you’ll be able to discover all ancient locations mentioned in a specific book, and visualize them in GoogleMaps as and when they are mentioned alongside the actual text. In the former case you know about the place, and want to find the books; in the latter you have the book, and want to find out about the place. Moreover, you’ll be able to do this either as a scholar whose research has a historical or geographical basis, or as a member of the public visiting, for instance, an ancient location and wanting to download information related to it on your iphone―a case of literally putting knowledge into people’s hands…

The important thing is that this information is now available to all―and that is tremendously exciting as well as testing. Not only are digital resources transforming dissemination practices (in, for example, how scholars/experts communicate their message to a broader public); no doubt they will also change the way that we―both as researchers and members of the public―do things.

We don’t yet know where this GAP year will take us, but thanks to the challenge that has been set us, we have the chance to start shaping research practice and help bring the knowledge derived from it out of ‘ivory tower’ institutions into everybody’s homes.

Posted in Uncategorized | Tagged , , , , , | 2 Comments