Ryan Cordell, David Smith, Abby Mullen, Jonathan Fitzgerald

Submitted Entry for 2015 Competition

Abstract

What do the millions of digitized newspaper pages now online actually contain? We can search for specific topics, names, words, or phrases across huge swaths of time and space, but the vastness and diversity of historical newspaper content still daunts exploratory research. While keyword search enables us to assemble evidence in a very targeted way, it can also foreclose serendipitous discovery of related material. The digitized archive can hide as much as it reveals.

These problems are particularly acute for nineteenth-century materials. During this period newspapers were an all-purpose medium, comprising familiar genres such as news or commentary; but also literary genres such as fiction and poetry; as well as a host of occasional genres such as recipes, squibs, vignettes, jokes, sermons, advice columns, and more. How can we make sense of this miscellany? Are there methods for exploring this content that could draw connections that simple search would obscure or even foreclose? Can we help scholars, students, and members of the public identify interesting materials they do not already know about before the research process begins?

Mining Miscellany will explore alternative methods for organizing and accessing digitized newspaper content. First, we will identify and group reprinted texts across the British Library’s Historical Newspapers and the Library of Congress’ Chronicling America newspaper collections. We will then develop methods for automatically categorizing those reprints by genre and topic, sorting them into groups useful for exploratory research. These genres will be evaluated both by our team and through a web interface that will allow the public to investigate the kinds of texts that were popular in their geographic location at different moments in the nineteenth century, and to see how texts spread to the wider country and beyond. The verification data provided through these means will help us improve our sorting methods and develop metadata about genre and topic that we can contribute back to the British Library and—if they are also interested—the Library of Congress. In short, Mining Miscellany tackles one of the biggest challenges facing large-scale digital collections of all sorts: how to make sense of their contents beyond keyword search.

URL for project:
http://viraltexts.org

Assessment Criteria

The research question / problem you are trying to answer

Please focus on the clarity and quality of the research question / problem posed:

Mining Miscellany will develop methods for automatically sorting, by genres and topics, texts that circulated within and between nineteenth-century Britain and the United States. Nineteenth-century newspapers were an all-purpose medium, printing both informative and entertaining texts for readers. Digitized corpora make these materials more accessible through keyword search and—for those scholars with the necessary skills and/or resources—computational text analysis. Due to their scale and miscellaneous content, however, historical periodicals remain difficult to browse or research in exploratory ways. Most historical newspaper archives are only minimally annotated. The best historical newspaper datasets have segmented newspapers by article and added information about article titles or authors, though the frequency of untitled and anonymous texts in historical newspapers make even this metadata unreliable as a discovery mechanism. Scholars cannot browse all the poems, recipes, or abolitionist texts in the British Library’s digitized newspapers, for instance, though such browsing (or filtering) could be immensely valuable, particularly at the beginning stages of research projects. Mining Miscellany will ask whether computational methods might help sort individual newspaper texts, at the article level, into categories meaningful for humanistic research. Can software reliably tell a sermon from a political speech, or fiction from war reporting?

In addition, Mining Miscellany will develop a web interface that will allow the public to browse popular transatlantic texts and identify their genre, which will in turn provide valuable verification data to help improve the project’s automated techniques. In an effort to make this a participatory activity that the public will be interested in, we propose to foreground content based on users’ geographic location, serving historical newspaper articles from, say, Oxford for users accessing the site from an Oxford IP address. Users will also be able to manually select a different location, search reprinted clusters, or browse clusters by genre. As users navigate through the site, they will be asked questions about the pieces they view: e.g. “Does this content actually relate to the locale you search on?” or “Choose the most appropriate genre for this content from the following list.” The geographic interface will offer an alternative mode of serendipitous exploration from most archival interfaces, while users interactions with the site will provide verification data that will help us validate genres automatically identified through computational text analysis.

Finally, Mining Miscellany will bring the British Library’s Historical Newspapers into direct conversation with US newspapers, demonstrating the extent and character of transatlantic reprinting and information exchange during the nineteenth-century. It is, for instance, a common truism in literary studies that far more British content was exported to the United States than was American content to England, but this project will allow us to test those ideas against a huge swathe of print from the period. This global investigation lends itself to natural experiments we are eager to undertake, such as comparing textual exchange between England and the US before and after the completion of the transatlantic telegraph cable in August of 1858. Our early work showed that articles announcing the cable’s first message, between Queen Victoria and President James Buchanan, were among the most widely-reprinted pieces in the US during the century, but how did that technological feat change what kinds of texts were shared between the nations, the frequency of exchange, and the networks of papers now instantly connected across the Atlantic Ocean? Which genres resonated most strongly in both national and transatlantic contexts?

Please explain the ways your idea will showcase British Library digital collections

Please ensure you include details of British Library digital collections you are showcasing (you may use several collections if you wish), a sample can be found at http://labs.bl.uk/Digital+Collections
The primary digital resource used by Mining Miscellany will be the 19th Century British Library Newspapers database. The project will offer an sophisticated example of the kinds of textual patterns detectable in large-scale digital archives. The resulting data will provide a new window (duplication) into the collection than is offered by search alone. By identifying duplicate text across the newspapers in the collection, Mining Miscellany will also create new links among the archive’s newspapers, highlighting shared content and uncovering networks of influence or sharing unapparent from the papers’ metadata alone.


In addition, the categorization methods developed in Mining Miscellany will provide a rich new and valuable facet for exploring the newspapers. The huge scale and diversity of newspaper texts have always made accessing their contents difficult in analog or digital forms. Hand annotation of such enormous corpora is likewise unlikely. By developing methods to automatically detect genres across such a corpus, Mining Miscellany will contribute back to the British Library a model for improving the browsability of its newspaper collections for scholars across a range of fields, adding serendipitous discovery of related texts to other discovery mechanisms. The interface for this aspect of the project will both bring “viral” texts from the nineteenth century to public attention, but will also help improve the granular metadata generated by the project, thus making it even more useful to the British Library over time.

Please detail the approach(es) / method(s) you are going to use to implement your idea, detailing clearly the research methods / techniques / processes involved


Indicate and describe any research methods / processes / techniques and approaches you are going to use, e.g. text mining, visualisations, statistical analysis etc.

To identify the texts that have been duplicated across the Atlantic, we’ll be using natural language processing algorithms already written by David Smith and used heavily in the Viral Texts project. These algorithms are designed to detect pairs of newspaper issues with significant textual overlap, align these issue pairs, and cluster these pairwise alignments into larger families of reprints. We rely on high-precision features such as matching word n-grams and on pruning pairs of issues with small n-gram overlap (N-grams are continuous sequences of “n” number words; a 5-gram, then, is a sequence of 5 words). We use shingling techniques to align regions of text across large-scale archives that share many matching n-grams, even when the texts do not match from beginning to end, due to changes in the nineteenth-century newspapers and the changes introduced through modern OCR. We describe our current methods in detail in papers at http://viraltexts.org/publications-and-press/



This project will introduce a new element: automatic classification of our reprinted text clusters into genres or topics. To classify the texts we discover, we will use programmatic modeling, likely using the R programming language along with MALLET, a topic modeling program. These classifications may be based on Naive Bayes classifiers, topic models, clustering models, or other models, which we’ll be able to better determine once we have initial results. We will verify these results both as a team, and also using a web interface to allow for crowdsourced verification.



To create the public-facing web interface, we will collaborate with the BL to create an attractive and user friendly website that will serve both as a means of inviting user to interact with BL resources while also providing verification of our classifications.



Finally, we will use the verification data generated through the web interface to help us improve our sorting methods, verify geographic information, and generate metadata about each genre that we can contribute back to the British Library.

Please provide evidence of how you / your team have the skills, knowledge and expertise to successfully carry out the project by working with the Labs team

E.g. work you may have done, publications, a list with dates and links (if you have them)

Mining Miscellany builds on three years of work in the Viral Texts project, in which we have developed computational methods for automatically identifying reprinted texts , unknown a priori, from large-scale archives of nineteenth-century periodicals. This work has been funded by the National Endowment for the Humanities and Mellon Foundation in the United States. Links to much of this work, and our team’s publications, can be found at http://viraltexts.org. While we have focused on the newspapers in the Library of Congress’s Chronicling America collection, our findings there have pointed to the importance of transatlantic circulation in the period, as many of the most widely-circulated texts we encounter in our findings originated (or are claimed to have originated) in the UK. Our skills will support the computational text analysis portions of the project, and we are eager to work with the BL Labs team to develop a compelling public interface for our data.

Ryan Cordell has been researching reprinting and nineteenth-century periodicals for the past six years, and has published on the subject in a range of venues, including Digital Humanities Quarterly and American Literary History. In 2013 he was awarded the best article prize by the Research Society for American Periodicals for his work in this area. For an overview of his work, see http://ryancordell.org

Over the past decade, David Smith has done significant work in the areas of natural language processing and computational linguistics, with a focus on text reuse, large-scale text analysis, and digital libraries. He currently supervises the work of several computer science graduate student working in these same areas. Details about his work can be found at http://www.ccs.neu.edu/home/dasmith/

Abby Mullen, a PhD candidate in the history department at Northeastern University, is a fellow at the NULab for Texts, Maps, and Networks, where she has worked on the Viral Texts project for the past three years. She is proficient in Python and R, and has taught workshops on various digital methods. Over the past semester, she has worked on data cleaning and creation of a text-analysis tool for her own dissertation work, learning skills that can be applied to Mining Miscellany as well.

Jonathan Fitzgerald is a PhD student in the English department at Northeastern University focusing on literary journalism and digital humanities. In addition to being a Research Assistant for the Global Viral Texts project, he has worked on several digital projects using a variety of methods including 3D printing, text analysis, and digital archiving. His most recent in-progress work is a live archive of an emerging genre--often referred to as InstaEssays--on Instagram. That project can be viewed at www.instaessayarchive.org.

Both Abby Mullen and Jonathan Fitzgerald have been active through their graduate careers in Northeastern University’s NULab for Texts, Maps, and Networks. Both have a growing proficiency in the R programming language which will be applied to classification and visualization tasks related to Mining Miscellany.

Please provide evidence of how you think your idea is achievable on a technical, curatorial and legal basis

Indicate the technical, curatorial and legal aspects of the idea (you may want to check with Labs team before submitting your idea first).

Technical
We know that the identification and extraction of reprinted texts is possible, given our previous work on the same task with American newspapers. The challenge for this project will be in the area of classification of genres. There are a number of ways we can go about this, including topic modelling and utilizing Naive Bayes classifiers in R. Several members of our team have done recent work on classifying and modeling, as well, so we have the skills and manpower required to make the project work. While classification requires some trial and error, we are confident that this will be a successful and compelling addition to the project.
The creation of the web interface will be the area in which the team has the least experience; however, we believe that this aspect of the project will be the one where the collaboration between our team and the BL team can be most fruitful.


Curatorial
What we need from a curatorial standpoint is access to the text files of as many 19th-century newspapers as we can get. Since the BL already has millions of digitized pages of newspapers, it should be fairly straightforward for us to index and analyze those digitized texts.

Legal
Nineteenth-century British and American newspapers should be in the public domain, so there should not be any legal problems, unless the BL holdings are under copyright through a third-party vendor. If that is the case, then we may need to negotiate with that third party to gain access to the texts we need.

Please provide a brief plan of how you will implement your project idea by working with the Labs team

You will be given the opportunity to work on your winning project idea between June 2015 - October 2015.

June 2015
Initial conversations with BL team about vision for the project (Mining Miscellany team)
Acquire texts and do possible transformations to allow them to fit our current setup (Mining Miscellany team)
Begin the detection process with the texts (Mining Miscellany team)

July 2015
After initial results, conversations with BL team about web interface and beginning creation of that interface (BL team + Mining Miscellany team)
Begin classification and modeling (Mining Miscellany team)

August 2015
Continue classification and modeling (Mining Miscellany team)
Prepare data for web interface (Mining Miscellany team)
Alpha version of web interface ready for testing by end of the month (BL team + Mining Miscellany team)

September 2015
Begin to write documentation for interface (BL team + Mining Miscellany team)
Begin to write up contextualization for results (what do these transatlantic texts show us, how do we read these results, etc.) (Mining Miscellany team)
Test alpha version of web interface (BL team + Mining Miscellany team)

October 2015
Release of web interface with all context written (BL team + Mining Miscellany team)
Record and address user feedback on interface (BL team + Mining Miscellany team)