Martyn Harris
Research Assistant
Birkbeck University London

Abstract

The British Library Microsoft Corpus contains more than 68,000 digital representations of original sources recording the history, poetry, literature, and philosophy from the period 1789 to 1914. The documents cover a broad range of languages and scripts, including English, French, Italian, Hungarian, Russian, and Malagasy, and vary with respect to the OCR quality of some documents.

The project will focus on the development of an web-application that visualises the shared-sequences representing common phrases, philosophical and poetical quotations, jokes, and liturgical formulae contained in the Microsoft corpus, and makes them available to user through a customisable view, which can be incorporated as a banner-style widget on a home page, or on the information screens in the British Library lobby. The goal is to make the documents more visible to the general public and researchers by exploring what they have in common with each other. The application will use the Samtla system's (Search And Mining Tools with Linguistic Analysis) document comparison tool to identify documents that are similar in content, and then extract the most common shared-sequences between multiple documents.

The tool will provide a number of simple options for users to customise how the shared-sequences are visualised. A slideshow-style visualisation advances through each related quote according to similarity or time period causing the text to evolve organically on the screen. A further view enables a researcher to export the shared-sequences as Key-Word-In-Context (KWIC) style output for inclusion in their research, enabling the tool to cater to different types of visitor to the British Library building and website.

Research Question / Problem

Identifying shared-sequences between documents can be a complicated task for a user to perform manually. In addition, due to the diachronic nature of language, and the editorial process, the shared textual fragments may diverge considerably, making them difficult to identify. A good example, is the various editions of the Bible in English, where the Wycliffe, Tynedale, and King James Bible are semantically similar in content, yet they diverge quite considerably in terms of how the text has been transcribed, due in part to language change over time. The editorial process results in common phrases and quotes being transplanted in whole or in part in, and included in the work of other authors and reference text. Furthermore, there are also issues associated with historical documents such as preservation of the original, and the printing technology available at the time, which together make the task of identifying similarities between documents even more complex, due to character recognition errors in the digitised version. Lastly, the Microsoft corpus contains documents in multiple languages and scripts and so the adopted method and implementation should be language-independent, in order to promote the content of the British Library content to the multilingual community of vistors coming to the British Library, or browsing the institutions website.

To address these issues, the project adopts a language independent and probabilistic approach to identify semantically related documents by measuring the similarity between the n-gram probability distributions of the documents. Once a group of related documents has been identified, we extract the top-n longest shared sequences and present them as quotations to the user. The tool uses a tunable parameter to control the degree of tolerance between the shared-sequences, allowing the sequences to diverge from a few characters upto several words, allowing us to compensate for the language and author specific variations between the quotes.

Showcasing BL Digital Collections

The project will use the Microsoft Corpus of 68,000 scanned books and associated metadata as the primary data source. In addition, users will be directed to additional sources of information in the form of the British Library catalogue records, and the scans of the original document stored online in PDF format.

Methods

The approach adopts methods from information theory and bioinformatics to identify related documents and shared-sequences between documents. Specifically, we use the Jensen-Shannon Divergence measure to assign a similarity score to each document based on the probability distribution of n-grams contained in each document. The n-gram distributions are generated through statistical language modelling and use a character-level representation for the documents, enabling the tool to be language and domain independent. The documents are then ranked by their JSD score, and the top-n related documents are extracted for each document contained in the corpus.

A second process extracts the shared-sequences from the text using an approach similar to the Basic Local Alignment Search Tool (BLAST) algorithm. Once extracted, the quotes are stored and passed to a web application that visualises the sequences in a number of different ways that cater to the different information needs of visitors to the British Library catalogues and website.

These include the following:

1.) Minimal:

This visualisation displays a randomly selected quote in the main area of the display to centralise the content. This view is most suitable for incorporating in to a webpage as a banner-style widget or on a kiosk-style screen in order to engage visitors' interest in the resources held at the British Library. If the application is hosted on a webpage, then clicking on the quote displays the next permutation of the sequence for each of the top-n related documents. A further option will enable a slideshow style progression through the quotes. By default the user will see the quotes sorted by their JSD score meaning that the quotes will start to diverge as the rank position increases allowing users to see how the quote has evolved as a result of the editorial process and language change due to dialect and time. A further option would enable users to view the quotes in chronological order using the timestamps stored in the metadata record for each document. In this way a user

2.) Browse:

This view option provides users with a filtering tool using the metadata record for the document associated with the shared-sequence, which will enable them to locate a specific document and view the related shared-sequences. This would be most suitable to a user who is more invested and wishes to explore the collection in more detail.

3.) Export:

A third option enables user to export the shared-sequences between the documents in a Key Word In Context style format, allowing them to save the results for their research or publications.

Evidence that Entrant(s) can successfully complete the project

Myself and my colleagues Prof Mark Levene, and Dr Dell Zhang have been involved in this kind of research for more than four years. The research outcome of my PhD is Samtla (Search And Mining Tools with Linguistic Analysis), a system designed in collaboration with historians, linguists, and archivists that provides tolerant search and text mining tools for historic text collections.

The document comparison tool in Samtla will be used to extract the shared-sequences. The language independent design of the Samtla system is demonstrated by the corpora on which it currently operates with, which demonstrate that the project could also be applicable to other digital collections stored by the Library:

1.) A corpus of Aramaic Magic Bowls and Amulets from Late Antiquity (6th to 8th CE) written in a number of related dialects including Aramaic, Mandaic, and Syriac. The texts are written in ink on clay bowls, and cover a wide subject matter. The research involves searching and comparing textual fragments, which are formulaic in nature and provide an insight into the development of liturgical forms which differ due to transmission over centuries, and orthographic variation as a result of differences in authorship or dialect. There are also transcription errors resulting from damage to the original artefact, or illegible characters. Existing tools were not sufficient for identifying approximate text fragments meaning the analysis was largely a manual process of comparison and documentation.

2.) The Vasari Research Centre are researching documents representing chapters from the book "Lives of the Most Excellent Painters, Sculptors, and Architects" by Giorgio Vasari (1511 - 1574). Giorgio Vasari is considered to be the founding father of the Art History discipline. The Vasari Samtla provides search and comparison over documents in the original Italian and the corresponding English translation, and is used for research.

3.) More recently we applied Samtla to a corpus, for a pilot study, organised by the British Library and the Financial Times (FT). The corpus contains editions of the FT Newspaper over a number of years. The digitised texts are represented by both the OCR data, and the original scanned pages of the Newspaper, which cover the year 1888, 1939, 1966, and 1991. This particular archive, required new tools that could leverage the image data in order to compensate for poor quality OCR that reflected the current state of the art at the time of digitisation. Much of the text for the earlier articles (e.g. 1888, 1939, and to some degree 1966) are not reliably searchable due to poor recognition rate. Consequently, the focus was on developing a metadata search component to complement the existing search tool, allowing users to search both the metadata and the full document.

4.) Samtla has been applied to the King James Bible, in English, for demonstration and evaluation purposes as many people are familiar with the content of the Bible.

5.) We also applied Samtla to the Microsoft corpus from which we developed a search filter tool from the metadata.

We therefore have experience with a range of corpora in different languages and time periods. We are also familiar with the complexities associated with natural language corpora, including those represented by historic text collections such as the Microsoft corpus.

How idea is achievable on a Technical, Curatorial and Legal basis

Technical:

The most challenging aspect of the project is identifying the related documents for each of the 68,000 documents. The approach may require a multi-processing approach using several servers, in order to obtain the results in a timely fashion. I have access to a range of hardware at the university that can be used to process the data and host the web application.

Curatorial:

It would be beneficial to obtain feedback from the curatorial team as part of an iterative design process of the visualisation aspect of the tool ensuring that the output is appropriate for the chosen platform (web or kiosk). Furthermore, there may be related information that can be linked to the tool so that users can be directed to the original documents and catalogue entry.

Legal:

The collection is copyright free as far as we are aware.

Plan

June 2016

Activity described here (e.g. what, when and by who)

Become more familiar with the document collection and gather sources of metadata and information about the documents including scans of the original that can be linked to each quote. This phase will mainly involve assessing the complexity of the task and addressing any potential scalability issues.

July 2016

Activity described here (e.g. what, when and by who)

Identify the related documents.

Extract the shared sequences.

August 2016

Activity described here (e.g. what, when and by who)

Filter the extracted sequences, removing any inappropriate content where possible.

Begin to design and prototype the user interface including the separate visualisations for each visitor type.

September 2016

Activity described here (e.g. what, when and by who)

Complete the implementation phase of the application.

October 2016

Activity described here (e.g. what, when and by who)

Work with the BL Labs team to host the web application on a part of the British Library's website, and explore other options including the kiosk information displays in the main lobby or around the library.