Peter Arnds, Jennifer Edmond, Alexander O'Connor


Submitted Entry for 2014 Competition


Abstract

Words on a page do not have a single meaning to the reader. Each lexical term causes the brain to bloom with associations and subtly influences the reader in how they view the overall text. It is not possible to imagine that the author and the reader share the same semantic framework, let alone two different readers. A common challenge in interpretation of literature is that the cross-disciplinary, connective goal is hindered by the differences of different disciplines' frames of reference.

This work builds on the extensive value of topic modelling digitised texts to examine how what a reader has read before might influence their view of the aspects of terms that they encounter in new texts. The work hinges on the notion that a computational 'blank slate' can be trained to be an extremely primitive analogue for a specific type of reader: How does the semantic field of a topic set restricted to poetry compare to one which has only read novels? How will either show the associations of a word such as 'wolf'? The work is technical and humanistic, choosing specifically to look at the interpretation of animal metaphors in text, and demonstrates the technical value of topic modelling in revealing hidden value in documents.

Assessment Criteria

The research question / problem you are trying to answer

Please focus on the clarity and quality of the research question / problem posed:

Is it possible to simulate the effects of different scholastic backgrounds on how semantic fields are represented in the mind of the reader using topic modelling and curated corpora?

Please explain the ways your idea will showcase British Library digital collections

Please ensure you include details of British Library digital collections you are showcasing (you may use several collections if you wish), a sample can be found at http://labs.bl.uk/Digital+Collections

This work will make extensive use of the variety of texts within the 19th C digital collection of the British Library. The project takes a unique approach to blending technical and humanistic inquiry to reveal hidden knowledge about texts, the scholarly process and how culture affects the reader. It is a showcase for the richness of the document collections, the incredible power of offering digitised text online and the value of humanistic interpretation twinned with technology. The impact of the work is also to provide a demonstration of how tools like topic mapping can be used to communicate subtle concepts to a broad audience, and show the benefits of digitisation in interpretation.

Please detail the approach(es) / method(s) you are going to use to implement your idea, detailing clearly the research methods / techniques / processes involved


Indicate and describe any research methods / processes / techniques and approaches you are going to use, e.g. text mining, visualisations, statistical analysis etc.

Topic modeling of the BL’s digital collection with regard to the appearance of myth related lexicon in a given time frame and different disciplines.

Scan the digital collection, which comprises books of the 19th century in literature, geography, history and philosophy, and create case studies to demonstrate topic modelling that is diachronic and thematic:

Take several thousand texts from the 19th century and scan for terms that relate to myth (such as wolf, forest, Dionysus, Apollo, metamorphosis, etc); then trace myth-specific terminology over time and academic discipline. As for time and literary genre: In literature, for example, the bourgeois age of the nineteenth century, especially the realist period of the second half of the nineteenth century, typically distanced itself from a vocabulary top-heavy in myth and metaphor. Can a digital mapping of terms that are pertinent to a genre such as mythical or magical realism generate semantic fields that vary from one genre to another, from the picaresque to the Bildungsroman, and from drama, to poetry to prose?

A relevant step in such topic modelling would be to scan the digital collection according to specific terms, such as ‘wolf’ to see in which semantic fields such a term, which is myth specific, appears.

A metaphor such as the wolf (or the forest) is read differently by different groups of people/scholars. How do feminists read it compared to philosophers, political scientists and historians, and how do academically untrained people read it? In mythology and literature the wolf is typically associated with roguishness, the picaresque, religious demonisation, etc., while historians associate him with the tyrant, feminists see the wolf as a typically male violator of female vulnerability (the Red Riding Hood tale has a lot to do with this and experiences countless feminist rewriting, such as by Angela Carter) and people untrained in any academic field (such as farmers) may just see the wolf as a trespasser, an aggressor, an animal that kills livestock. No doubt the mythological perception of wolves has, through the ages, been impacted by real life experience, so it may ultimately not be possible to draw crystal clear lines between the perspective of farmers and mythologists or historians with regard to this animal and the associations it evokes. But a digital mapping of the semantic/lexical field surrounding such a term as wolf in the four different 19th century collections of the BL would no doubt reveal the different ways of interpretation surrounding terms that, in my own research, I am primarily interested in from a mythological perspective.

Sub-corpus selection has been demonstrated to be of interest in (http://www.sciencedirect.com/science/article/pii/S0304422X13000648), where previously under-appreciated documents have become more prominent. The use of Topic modelling in literary analysis is extensive (eg http://www.sciencedirect.com/science/article/pii/S0304422X13000673) but this work is novel in that it examines the interpretative consequences of the modelling via the selection of different initial corpora. This work will build a framework for comparing different interpretive frames based on different corpus selections, by looking at the resulting topic compositions and their effect on document classification. It will leverage best-of-breed existing libraries in the open source area.

Please provide evidence of how you / your team have the skills, knowledge and expertise to successfully carry out the project by working with the Labs team

E.g. work you may have done, publications, a list with dates and links (if you have them)

The Trinity College-based team is an active and established node, working together within the context of the Marie Curie International Research Staff Exchange Scheme project ""Social Performance, Cultural Trauma and Reestablishing Solid Sovereignties"" (SPECTRESS). The three team members have complementary skills in literary analysis, digital humanities and curation and computer science respectively, as follows:

Professor Peter Arnds is the Director of the MPhil in Comparative Literature and the MPhil in Literary Translation and a Fellow at Trinity College Dublin, where he also teaches in the German and Italian departments. His interests in literature and cultural studies are widespread, from Sophie von La Roche to W.G. Sebald, post-Holocaust literature, magical realism, the satirical visual arts, the wolf-man, travel literature, and translation theory. His publications include two monographs, about 50 peer-reviewed articles on literary criticism, as well as numerous short stories and poems. He is currently working on a book entitled Myth Matters: Trauma, Genocide, World Literature, a project that has evolved out of his monograph “Abandoned: The Wolf Man in German Literature from the Middle Ages to the Holocaust” (forthcoming).

Dr Jennifer Edmond is an established figure in the field of European digital humanities and research infrastructures. She is coordinator of the €6.5 M Collaborative European Digital Archival Research Infrastructure (CENDARI) Project and a central contributor to Europeana Cloud, the Network for Digital Methods in the Arts and Humanities (NeDiMAH) and COST IST 1005 Medieval Resources and New Technologies. She is widely published in prominent journals on issues of digital infrastructure, collaboration and the application of digital humanities methodologies to the cultural heritage sphere, in particular in archives and libraries.

Dr. Alexander O’Connor is a highly experienced Research Fellow with a proven track record, through his 10 years of research on CNGL (Challenge Leader), FP7 (CULTURA, CENDARI) and SFI TIDA Emizar. He is a lecturer in Networks and Distributed Systems at the MSc level in TCD, specialising in Network Management. He has published 3 journal articles and 29 conference papers with an h-index score of 6. He has co-supervised 4 PhDs and supervised 3 MSc students. He has successfully collaborated with industrial research teams and in several infrastructure initiatives, including EUROPEANA and DARIAH. A partial list of publications is available at http://www.dblp.org/pers/hd/o/O=Connor:Alexander.html

Please provide evidence of how you think your idea is achievable on a technical, curatorial and legal basis

Indicate the technical, curatorial and legal aspects of the idea (you may want to check with Labs team before submitting your idea first).

Technical
The technical approach will leverage existing well-proven libraries for topic analysis. Specifically, the work will begin using the gensim topica modelling library for python, using data cleaned both via the British Library's own tools. and the Python NLTK library. The LDA algorithm will initially be trialled, but Topic Vectors are also available using this library. If the corpus grows beyond the data capacity, then Apache Mahout will be deployed as needed. The presentation of the results will leverage D3.js, Python Flask micro-server and modern html/JS development. An agile, iterative approach will be selected for development.

Curatorial
Although we do not yet know the exact nature of the full range of texts that will be available to us for the purpose of this investigation, the questions are generic enough that we are confident of being able to compile a suitably rich and large corpus from the BL digital holdings. The exact nature of what we include may change slightly the frame of reference, but not the validity of the questions or methodology.

Legal
The document corpora will be restricted to items which have an open or otherwise derivable license. Particular attention will be paid to the specific attribution, re-sharing and modification permissions for the texts.

Please provide a brief plan of how you will implement your project idea by working with the Labs team

You will be given the opportunity to work on your winning project idea between May 26th - Oct 31st, 2014.

May 26 2014- Onwards
Initial Planning, Corpus exploration, Iteration Definition

June 2014
Initial Implementation of first topic model, testing of robustness of parameter selection and model reprentativeness

July 2014
Implementation of further topic models, introduction of new text classification

August 2014
Development and testing of public-access web environment to tell the digital story of metaphor and scholastic pre-conception

September 2014
Corpus expansion and diversication, dissemniation to other researchers in CS, and Humanities via CNGL, CENDARI and other projects
Possible invitation of other themes

October 2014
Final delivery, testing and packageing of the software, publication