Combining Text Analysis and Geographic Information Systems to investigate the representation of disease in nineteenth-century newspapers. (Category: Research)

Name of Submitter(s): Catherine Porter, Ian Gregory, Paul Atkinson
Name of Team: Spatial Humanities: Texts, GIS & Places
Organisation: Lancaster University

Although part of a greater project, this particular piece research focuses on the discussion of disease, in nineteenth-century newspaper media. The case study chosen is a London based newspaper, the Era, which has been digitised and made available by the British Library. The digitised corpus (1838-1900, constituting over 377 millions words) is explored using innovative and varied selections of qualitative and quantitative mechanisms to determine how the Era discussed and portrayed disease, both temporally and spatially.
Two aspects of the Era’s content are of particular interest; (I) how did the media acknowledge and discuss disease and mortality in time and space and; (II) how did media interest in disease correlate with deaths from disease?
These questions are tackled on a temporal and spatial basis by building on previous work conducted by the project ‘Spatial Humanities: texts, GIS, places’, based in the Department of History at Lancaster University. The crux of the methodology combines Corpus Linguistic analysis with Geographic Information Systems (GIS) in a new and groundbreaking technique named Geographical Text Analysis (GTA).
The investigation is based on three key disease groups and initially each of these is explored in terms of the frequency with which they are mentioned in the text. Next, collocation analysis is used to decipher which words co-occur with the disease ‘key words’, and the concordances are considered. Further enquiry is then conducted to assess the correspondence of the newspaper’s mention of disease with mortality statistics extracted from the nineteenth-century Registrar-General’s decennial reports. Lastly, GTA is employed to analyse the geography of disease discussion in the text (locally, nationally and globally).
This new and exciting methodology focuses on a combination of the more traditional close reading of historic texts and distant reading. By combining techniques from corpus linguistics, history and geography disciplines this research provides, for the first time, insight into how the newspaper media acknowledged, commented on and indeed portrayed disease in the nineteenth-century, as well as WHERE they focused their interests geographically.
‘Spatial Humanities, Texts, GIS, Places’, is a 5 year European Research Council (ERC) funded project that aims to create a step-change in how place, space and geography are explored within the humanities. The project is currently made up of the PI, Professor Ian Gregory, and two Research Associates, Dr Catherine Porter and Dr Paul Atkinson. After spending a number of years developing new methodologies we are now applying techniques, such as Georgaphical Text Analysis (GTA), to a broad spectrum of large digital corpora, in this particular case the analysis of nineteenth-century newspapers sourced from the British Library.
Problem / Challenge Space:

Much has been published by historians and demographers on health and mortality in nineteenth-century England and Wales. However, existing research has failed to highlight the role played by newspaper media. Considering the prevalence of disease in nineteenth-century Britain, which statistics show drastically affected public health, the media is likely to have played a crucial role in reporting and disseminating to the population information on the health and disease.
The investigation of newspaper media, often using various forms of corpus linguistic analysis, is a proven and advancing science, research varying from topics such as racism, to gender. Some research has highlighted the role played by nineteenth-century newspapers in discussing and publicising health and disease, but these are largely case studies based on countries other than Britain. Additionally, the majority of research into health discourse in the media has tended to focus on twentieth and twenty-first century newspaper publications in countries such as China, South Africa and New Zealand, perhaps because, until recently, few early newspapers were available in digitised format. In fact, there is no one study that has investigated how nineteenth-century British newspapers discussed and portrayed public health and disease both locally and abroad.
Contemplating existing publications, and with the advent of new tools and methods that facilitate a combination of close and distant reading, this points to a clear gap in existing research that this research strives to occupy.
As such, two aspects of the Era’s content are of particular interest; (I) how did the media acknowledge and discuss disease and mortality in time and space and; (II) how did their interest in disease correlate with deaths from disease?
Through these questions we are able to, for the first time, explore not only how disease was portrayed in British nineteenth-century newspaper media but also provide insight into the geography of this discussion. So far, the analysis of disease has provided us of evidence of specific columns in the newspaper that regurgitated the Registrar-General's weekly health reports, as well as the inclusion of patent medical advertisements for products often know as 'cure-alls'. The geography of disease both home and abroad has also pointed to the places that were of greatest interest to the newspaper as well as the importance of Britain as a colonial and trading nation during this time.

Approach / Methodology:

This project introduces a new methodology for the investigation of large digital corpora, specifically historic texts, such as nineteenth-century newspapers. It uses a mixture of qualitative and quantitative methodologies, and is the first of its kind to combine corpus linguistic techniques such as frequency and collocation analysis with the investigation of geography via Geographic Information Systems and various statistical techniques.
In brief, the analysis of the text is based on three key disease groups (Woods, 2000), Crowding, Food and Water borne, and Respiration, each of which correspond with prominent causes of death in the nineteenth-century. Initially, each of these groups is explored in terms of the frequency with which they are mentioned in the text. Collocations of the key disease groups are also investigated, and following this, further analysis is conducted to assess the correspondence of the newspaper’s mention of disease with mortality statistics. Lastly, Geographical Text Analysis (GTA) is employed to analyse the geography of disease discussion in the corpus (locally, nationally and globally). This process geoparses the newspaper text, identifying and assigning coordinates to 'places' that collocate with disease key words. These places are then mapped and analysed using GIS software and various statistical techniques such as density smoothing and Kulldorff.
The overall methodology used in this project differs from any seen before, and as such, provides outputs that are not achievable via more traditional close-reading techniques.

Extent of showcasing BL Digital Content:

The digitised newspaper text sourced from the British Library is the crux of this research project. The digitised editions of The Era (1838-1900) newspaper include over 377 million words of text and form the basis of the temporal and spatial analysis of disease representation in newspaper corpora throughout the nineteenth-century ( ). Without this digital product such a project would not be possible as more traditional methods of close reading, especially from the original newspaper text, would not reveal the intricacies in the disease portrayal being researched, especially in terms of geography.
Other digital content used in this project is the Registrar-General decennial statistics. These are sourced from the HistPop collection ( ) and the Great Britain Historical GIS (GBHGIS). These data are compared qualitatively and quantitatively with the data extracted from the text that makes up the Era newspaper.

Impact of Project:

Although a relatively new piece of research (started earlier this summer), this work is the first time that the ground breaking technique named Geographical Text Analysis (GTA) has been applied to historic newspapers. Currently the research is being disseminated at conferences (for instance, at "Making ‘Big Data’ Human: Doing History in a Digital Age", University of Cambridge, 09/09/2015) and has been well received, many participants interested in learning how they may apply this methodology to their own research. A full paper is currently being prepared for publication in a renowned peer-reviewed journal that will provide impact and outreach in terms of the GTA methodology as well as advertising for the British Library digital collections to a world wide audience.
We also see this piece of research as a template for further research into nineteenth-century newspapers as well as other texts, as the combination of techniques provide a new and important methodology for the assessment of historic texts in the humanities. For instance, we envisage an extension on this project being the assessment of further newspaper corpora from other parts of Britain (for instance, rural) and indeed other countries and time periods. Overall, the methodology we have developed will change how we assess large digital corpora and provide new insights into texts that have not been possible until now.

Issues / Challenges faced during project(s):

The main challenge has been in the development of the methodologies for this project. From the outset we have been keen to provide new avenues in which to explore historic datasets, particularly keen on the crossover between qualitative and quantitative methods in the digital humanities. With time and testing we have now achieved a fully working technique (Geographical Text Analysis) that is already producing publishable results.
Access to digitised textual data (primarily historic corpora) has also been difficult, and in some cases expensive. Thanks to institutions such as the British Library we have been given access to some wonderful resources in order to not only test the methodologies but to provide meaningful outputs and contributions to the fields of corpus linguistics, history and geography.