M. H. Beals

Submitted Entry for 2015 Competition


In the wild west of the World Wide Web, if you compose a hilarious joke, provide a simple solution to a complex problem or break a major new story, it is almost certain that you will be plagiarised. Although intellectual property laws exist, they are inconsistently enforced because of the sheer number of sites where it occurs – a number that increases with each passing second. If you are lucky, and your re-poster is honest, you may discover how far your ideas have spread through a pingback, an automatically generated comment on your original text with a link to its reprint.

In the nineteenth century, reprinting – especially unauthorised reprinting – was the backbone of Atlantic journalism but these authors had no effective means of discovering the fate of their quips or queries, except through chance encounters with competing papers. Yet, whether prompted by the honesty of the editor, or the desire to establish authenticity, a significant minority of articles contained an attribution. Appearing as either the initial dateline or a tag at its conclusion, these range from the very specific to the frustratingly vague. With the aid of these unobtrusive breadcrumbs, text-mining scripts, and a helping hand from the crowd, historians can provide what Georgian authors could only dream of – a in-depth understanding of just who was stealing from whom.

Georgian Pingbacks will work with BL Labs to map the 19th Century Newspapers database, providing a snapshot of the interconnectedness of the corpus and suggesting which hubs, or heavily referenced titles, exist outside it. The project will mine the XML of the newspaper collection for phrases such as ‘–From the’ as well as a list of known title components (Courier, Advertiser, Herald). Because the OCR text has too many errors to identify publications automatically, human intervention is needed. Working with the BL Labs team, I hope to construct an online space, using existing interfaces, where volunteers can contribute their human reasoning to the problem of identification. By providing visitors with the top and bottom inch of text, they can tag the attribution by title, date and/or location. After multiple verifications, I will codify the titles, reconciling duplicates and variants, and create a social network diagram of the references within the corpus.

Such a diagram will help historians understand the works they turn to so often and will surely be a thing of beauty.

URL for project:

Assessment Criteria

The research question / problem you are trying to answer

Please focus on the clarity and quality of the research question / problem posed:

Scissors-and-paste journalism, the process by which one periodical copied, in part or in whole, the text of another, was the backbone of Atlantic publishing in the early nineteenth century. Almost every publication was, in some way, part of this decentralised and largely unregulated network of information.

Although this is roughly understood by those who study periodicals directly, its nature and extent remain abstract for most humanities researchers and may be wholly hidden from the members of the general public using the Library's new Newspaper Reading Room. Thus, mistakes are often made in the search for public sentiment or in the identification of named individuals when the reader is ignorant of the text's true geographic (or chronological origin). Too often a researcher's hopes are dashed upon realising (usually at the end of an article) that a potentially game-changing piece of evidence is irrelevant, having been snipped from a rival publication. Too often, too, does this experience raise the spectre that many more articles have been clipped without attribution, calling his or her wider conclusions into doubt. Georgian Pingbacks will help all these groups by providing a overview of the nature of the early-nineteenth century press, highlighting its shape, extent and possibilities. Moreover, it will highlight, alongside 2013's Sample Generator, where voids exist within our digital reconstruction of this sprawling nineteenth-century network.

The project will also challenge our current understanding of the ideas of attribution, fair use and the public domain, as we develop new ways of identifying the connections between original and reprint. To what extent does acknowledgement seem to be gentlemanly conduct rather than legal requirement? How did editors signal a reprint and for which types of texts was an attribution considered necessary or appropriate? The rapid rise of a new reprint culture on the World Wide Web has reignited questions of legitimate and unauthorised reuse and understanding the subtle and unspoken rules of the Georgian public domain will help us better understand our own.

Finally, the project will continue our exploration of the XML underpinning the Newspaper Database: to what extent we can identify specific text tokens from the OCR collection and how an we best integrate simple crowd-sourcing mechanics to verify and codify messy data.

Please explain the ways your idea will showcase British Library digital collections

Please ensure you include details of British Library digital collections you are showcasing (you may use several collections if you wish), a sample can be found at http://labs.bl.uk/Digital+Collections

Georgian Pingbacks will focus its attention on the early years (1800-1837) of the 19th Century Newspapers online database, using the 66 digitised Georgian titles (58 English, 2 Irish, 4 Scottish and 2 Welsh) within the collection. Despite the Georgians Revealed exhibition at the British Library, and the 18th Century Season on the BBC, the Victorian press has retained its dominance in the public imagination, helped along by Jeremy Clay’s contributions to the BBC News Magazine and the addictive (if groan-inducing) humour of Bob Nicholson’s Victorian Meme Machine. Yet, it is in the earlier period that we find the most obvious parallels to the modern spread of information on the World Wide Web – the almost infinitely polar networks, decentralised and democratic. By tracing the links, back and forth, between late-Georgian periodicals, the project will highlight the robustness (and pitfalls) of the early years of the collection, raising their profile and increasing confidence in their use by historical and literary researchers.

This project will also provide future researchers with a strong methodological framework for working with this and other large corpora of OCR-transcribed XML data. This data, designed to underpin search functionality and visual display, can often seem unwieldy to researchers undertaking text-mining projects. Through the targeted extraction of information from the collection, and the use of that information to model the network underlying that collection, Georgian Pingbacks will showcase the British Library's 19th Century Newspapers database in both form and content.

Please detail the approach(es) / method(s) you are going to use to implement your idea, detailing clearly the research methods / techniques / processes involved

Indicate and describe any research methods / processes / techniques and approaches you are going to use, e.g. text mining, visualisations, statistical analysis etc.

The creation of the Georgian Pingbacks network will be undertaken in four sequential, if overlapping phases:

The first stage will involve the identification of articles within the 1800-1837 period that likely include a textual attribution. This will be done through an XSTL transformation of the relevant XML file bundles, using regular expressions to represent the likely attribution tags and title references. As even this limited time frame represents a very large collection of newspapers, the project will begin with the 20 northernmost periodicals (Scotland and Northern England), for which I have the greatest contextual understanding, and which represent a counterpoint to the dominance of London (namely the Times) in many periodical studies.

Once identified, the second stage will involve the extracting images – namely the first and final inch of text – from the relevant articles on a title by title basis. These will then be inserted into a crowd-sourcing platform. Although sophisticated platform such as Omeka allow for public transcriptions, the limited amount of data required for this project allows for a much simpler format. My vision for the platform is a single, stable URL, which will pull header and footer images from a central database and display a different article set with each refresh. Underneath the image, a simple series on input boxes – title, location and date – would allow users to input their interpretation of the relevant dateline or footer tag. This could then be submitted to a central database, to be codified at a later date. While seemingly random to the user, the site would ideally work on a title-by-title basis and cease displaying a particular image once it has been verified five times.

The third stage, an extension of the second, will be the deployment of the crowd-sourcing site and a social media campaign to encourage frequent contributions to the project. If the integration of an existing log-in system is possible, the development of achievement badges, signifying the number of links attributed, may be a means of encouraging frequent engagement and friendly competition via social media.

The final stage, undertaken in parallel with the third, will be the codification of attributions. Using resources such as the Waterloo Directory of 19th Century Periodicals, truncated titles will be formalised, duplications rationalised and inconclusive hits removed. As each title is rationalised, a social network diagram, alongside relevant network matrices, will be published on the project website. As these individual network hubs interconnect additional mega-network diagrams will be produced and published. If possible, a manifest of derived data, identifying, but not reproducing, individual article-attribution pairs will also be made available to promote continued work on establishing and understanding these networks.

Please provide evidence of how you / your team have the skills, knowledge and expertise to successfully carry out the project by working with the Labs team

E.g. work you may have done, publications, a list with dates and links (if you have them)

I have been researching British newspaper history for the past four years, and have made extensive use of newspapers material, digitised, microfilm and in their original form, since beginning postgraduate studies in 2004. My experiences and methodologies appear frequently on my research blog (www.mhbeals.com), averaging 1500 hits per month, and I codified this advice into a short guide to using newspaper in historical research for the Higher Education Academy in 2011. I thus have an deep understanding of the mechanics and quirks of periodical material from this period and long experience in working with various digitized newspaper archives, including those maintained by Readex, Proquest, the Library of Congress and GaleCengage.

My research now focuses specifically on the distribution of reprinted material throughout the Anglophone world, resulting in a number of forthcoming publications as well as an online database of colonial news content. The latter (www.scissorsandpaste.net) is a collection of articles that first appeared in North America or Australasia and were later reprinted in British periodicals. The information is stored in a TEI-compliant XML database, and has been made available (CC-BY) along with several sets of derived data via a GitHub repository (http://www.github.com/mhbeals/scissorsandpaste ). Having encoded my transcriptions into the TEI standard, and having created a significant number of XSL transformation documents to render these transcriptions in alternative formats (csv, md, html), I am confident I have the necessary technical skills to navigate the 19th Century Newspapers XML data and to work alongside the British Library Labs team to efficiently mine and transform the data for crowd-sourcing verification. I am likewise familiar with HTML and CSS standards, including responsive displays, and can therefore engage in technical discussions on the best means for developing a robust project site. I have no relevant experience in creating an API, which will likely be needed to power the crowd-sourcing platform, and will need the particular support of the BL Labs team in this area.

Select Bibliography
Digital History: An Introduction. London: Bloomsbury, Forthcoming 2016.

“The Role of the Sydney Gazette in Scottish Perceptions of Australia, 1803-1842” in Historical Networks in the Book Trade (Forthcoming, 2015).

“Historical Insights – Focus on Research: Newspapers” with Lisa Lavender. History Subject Centre (2011)

“Dumfries & Galloway Courier (1809-1939)” Dictionary of Nineteenth-Century Journalism (Proquest)

“‘Passengers Wishing to Embrace This Commodious Conveyance, Will Apply Immediately’: The Rise in Emigrant Passage Advertising in the Scottish Borders, 1800-1830” International Journal of Local and Regional Studies 4:1 (2008)

Please provide evidence of how you think your idea is achievable on a technical, curatorial and legal basis

Indicate the technical, curatorial and legal aspects of the idea (you may want to check with Labs team before submitting your idea first).

Technically, I believe this project to be fully achievable within the time frame of the competition. The XML transcriptions are of varying quality across the corpus, but this can be mitigated by concentrating on the cleanest data first, and proceeding to other titles as time allows. As a proof of concept, even a few titles across a single year or month would provide significant insight into the methodological approach and allow for the project to continue with relatively little support from British Library Labs after October 2015.

The crowd-sourcing platform requires only a very simple interface – importing image, either directly from the 19th Century Database via an API or through the random display of a pre-cropped snippet, and a series of text boxes for users to input attribution metadata. If direct access to the relevant images cannot be achieved, but the images can be imported by British Library staff on site, the pre-cropping of images can be achieved relatively easily through batch image editing software. If they cannot be accessed and cropped automatically or downloaded directly, I will rely on the technical advice of the British Library Labs team to help expedite the process of selecting and capturing the relevant image snippets. In a worse case scenario, a single title over a single year can be manually ‘snipped’ after articles are identified via text mining.

The ability to promote the project, and attract a suitable number of volunteers, is a concern, but can be mitigated by building upon the success of other British Library crowding sourcing projects.

The materials required for this project, namely the XML transcriptions that support the search functionality of the 19th Century Newspapers online database, are in possession of the British Library; however, these and the images from which they were derived are currently copyrighted by GaleCengage and I will therefore need to enter negotiations with them to mine the data in the manner described above and to display the dateline and tag-line snippets on the crowd-sourcing platform.

At the conclusion of the project, I envision that a database of derived data – including the publication title, publication date, an article identifier (either the headline or numerical identifier), and the attribution information – will be made available CC-BY through the project website, the British Library Labs website and other relevant online repositories. Alongside this database, I also intend to release, under the same licenses, all Gephi-generated social network diagrams and data developed from the derived data set.

The historical material used in the project were published prior to 1838 and are therefore in the public domain; however, the XML transcriptions and images of these items are currently copyrighted by GaleCengage. If access to snippets is not possible via negotiations, access to the crowd-sourcing platform would have to be restricted to subscribing institutions or, more likely, on site access only, which would be detrimental to the scale of the project, but could possibly be mitigated by encouraging visitors to St. Pancras to engage in the project while waiting for delivery of their newspapers.

Please provide a brief plan of how you will implement your project idea by working with the Labs team

You will be given the opportunity to work on your winning project idea between June 2015 - October 2015.

June 2015

a) Work with British Library Labs team to select the sets of XML data that are best suited for text-mining

b) Develop mining protocols (regex) and desired outputs (metadata)

c) Negotiate with GaleCengage regarding the use of snippet images

d) Begin text mining to identify candidate articles

July 2015

a) Continue text mining to identify candidate articles

b) Prepare crowd-souring platform

c) Create relevant social media platforms

d) Create project site (or redevelop a section of www.scissorsandpaste.net)

August 2015

a) Continue text mining to identify candidate articles

b) Launch crowd-sourcing platform including FAQ

c) Work with British Library Labs Team to promote project and platform

d) Begin crowd-sourced data codification after completion of first set

September 2015

a) Continue text mining to identify candidate articles

b) Continue crowd-sourced data codification

c) Begin social network diagram construction

d) Publish social network diagrams and derived data as completed

October 2015

a) Continue crowd-sourced data codification

b) Continue social network diagram construction

c) Continue publishing social network diagrams and derived data

d) Work with British Library Labs to promote project and competition

e) Compose and publish report on project methodology and findings