The GAP project has generated data linking references to places in public domain books (from Google Books and the Open Library) to place-entities published by the Pleiades Gazetteer. We have specifically been focusing on public domain literature so that we could conduct our analyses and release our results with no strings attached. In releasing these results (see here), our team discussed some of the intellectual property issues associated with extracting information from large copra of texts. We believe these have important policy implications, especially with regard to who can own and discover “facts” about books.
Facts have an important place in copyright because they can’t be copyright protected. Nobody can copyright the speed of light, the height of the Statue of Liberty, or the number of proper-nouns found in a given text. It is good scholarly practice to cite sources for such factual information, but it is not a legal requirement. Even the number of times “Rome” is mentioned in one text versus another is largely a factual issue, especially if one is explicit about analytic techniques, toponymns, and the like. Explicit methods, automated through software, reinforces the factual nature of our data (with regard to copyright). No doubt some of our data are wrong, but these errors can be considered errors of measurement or methodological limitations. Another group would get exactly the same results with the same inputs and the same software. As is appropriate with the “factual” nature of our data, we waive all copyright claim to these results using the Creative Commons Zero dedication.
The above discussion is not meant to belittle or discount our effort. After all, we put a great deal of thought and work in preparing the gazetteer data, preparing the GeoParser, and preparing, analyzing and comparing data developed by the Hestia Project with data generated by the GeoParser. We do not want to trivialize the effort and judgement required to prepare automated processes to identify places in texts. At the same time, however, we think that these IP issues have important ramifications on the future of scholarship. Information automatically extracted from texts will likely play an important role in shaping future understandings of scholarly literature. Data mined from texts are and will be the subjects of scholarly analyses and interpretations. Yet, much scholarly literature is held in copyright and held by organizations that severely restrict access (especially many academic publishers). As text mining approaches gain traction, software generated summarizations and other information extracted from large text corpora may be guarded as proprietary intellectual property by the organizations that happen to now control large corpora.
If data mined from large corpora of scholarly texts are proprietary, and if the owners of these corpora do not allow outside researchers to conduct independent analyses of these corpora, then information critical to understanding in many areas of scholarship will be locked down. The owners of larger corpora would have a powerful and almost unconstestable and maybe unaccountable means to shape research agendas and understanding. Thus, we see IP issues to be a key area of concern in the emerging world of large-scale textual analysis of research corpora.
We’re really happy that representatives of Google raised no objection to the public domain dedication of our data, and we hope to see more such public domain data from more works and more corpora, from Google Books as well as other sources, including academic journals. Hopefully this discussion and our own small release of public domain data will help raise awareness of these critical concerns.