Connected Histories and Data Mining: New Tools for the Digital Humanities

Bob Shoemaker (Professor of Eighteenth-Century British History, University of Sheffield), one of the primary academics behind the Old Bailey on-line archive (http://www.oldbaileyonline.org/), recently came to address the Open University’s Digital Humanities seminar (22 November 2010). Having been set up with an AHRC grant and lottery funding, but going now for some eight years, the Old Bailey project has so far involved the manual keying-in of documents (from the Proceedings and Ordinary’s Accounts) and their ‘mark up’ in XML to provide access to key bits of information (crime, date, location, defendant, victim, judges, etc). But Bob hadn’t come to talk to us about how great the website was, though it has won or been nominated for several awards. Instead, he wanted to talk about their plans to address a couple of limitations they had identified: its lack of links to other resources and its inability to use the data that it does have. Both points have relevance for Digital Humanities and, more specifically, the GAP project.

1. Linked Data

His initial response to the lack of link up had been to look to add new datasets that would supplement the Old Bailey records, such as parish archives and criminal records. But this centralizing tendency came at a cost—in terms of both time and labour. Instead, his team hit upon using a federated model: i.e., not a website that would house all the data itself, but a portal that, using a federated search facility, could point users to original websites specializing in the relevant data. One such portal that Bob has been involved in developing is Connected Histories (http://www.connectedhistories.org), due to be launched in March 2011, which will facilitate the discovery of a wide range of distributed digital resources relating to early modern and nineteenth-century British history.

But, as Bob explained, a major obstacle to research remains even with this more devolved search facility: the kinds of searches that can be done—by name, place, date, keyword, etc—are still limited by being predetermined. What was needed was to find a way of approaching the data without preconceptions and assumptions.

2. Text Mining

The answer, according to Bob, was to make use of data mining tools that could extract meaningful patterns from masses of data, which could then be analysed. Keyword searches tend to produce too many results; or else the ranking of results can be open to question. (Google, for example, ranks search results according to popularity: which is fine if you’re a consumer looking for a product, but is less well suited to the academic researcher, who, if anything, is looking for the data which is less well known and poorly excavated.) Keyword searches, moreover, have been talked about in terms of looking for needles in haystacks: of potential more use to the scholar would be a search facility that could point to the shape and size of the haystack itself…

This is work-in-progress for all concerned (Bob pointed us to another Old Bailey spin-off, http://www.criminalintent.org, which heralds the beginning of ‘drilling down into the data’): but Bob gave a useful run-down of the tools that could assist the Digital Humanities researcher in making more sensitive enquiries by approaching the data without preconceived questions. Three he mentioned are:

i) Zotero (http://www.zotero.org/): a citation management tool that maps word frequencies to create a cloud (wordle), which highlights prominent themes;

ii) TAPoR (http://portal.tapor.ca/portal/portal): the Text Analysis Portal for Research maps word usage over time, including peaks (or ‘trends’), density, collocations, and types (unique words);

iii) Compression Analysis: this tool measures degrees of similarity between texts based on repetition of word patterns (a ‘more like this’ function), and learns from experience…

Bob’s talk has given me much food for thought. The two areas which he identified as the next stage for his own project—linked data and text mining—are two areas which I can see being very relevant for the kind of work that we at GAP would like to do with ancient places: linking the ancient places in Herodotus’ Histories to other datasets (whether they are other ancient textual sources, or secondary scholarship, or even artefacts); and finding out more about the citation patterns in Herodotus (and other authors) for ancient locations, such as their collocations with other places/nouns or the verbs that connect them. But that’s for another blog…

About eltonteb

Working class classicist. Interested in the spatial form of texts, developing digital tools for humanists, and Homer.
This entry was posted in Uncategorized and tagged , , , , , , , . Bookmark the permalink.

1 Response to Connected Histories and Data Mining: New Tools for the Digital Humanities

  1. Emily says:

    Way cool! Some very valid points! I appreciate you
    writing this write-up and also the rest of the site is also really good.

Leave a comment