King's College London
PhD Researcher

Abstract

Improvements in computer technology combined with the existence of digital archives are enabling the humankind to access an increasingly higher amount of cultural knowledge. Nevertheless, searching large and heterogeneous corpuses of digitalised textual data in an effective and focused way has become increasingly complex. This project seeks to help the British Library to display and query its digitised 19th Century Book Collection.
The ambition is to generate relevant tag information using the OCRd text of the 68,000 digitised volumes and associate the selected words to the images of scanned pages. Once the tags are created, they are embedded in the metadata of the original photographic renditions of book pages and subsequently uploaded in an indexed collection of the British Library on its Flickr account.
In addition of offering an optimised presentation of the books in a digital environment, the project will use advanced tools of textual analysis in order to determine what metadata are appropriate for this particular database.
The project will build a thesaurus of names, places and organisations. Name entity extraction (using capitalisation patterns) will be used in order to build a lexicon of countries, cities organisation names and given names cited in book pages. Once extracted, the geographical keywords can be geo-referenced to their coordinates and automatically mapped using Flickr digital mapping service. In addition of the creation of this gazetteer, a British Library blog page can be set up in order to enable individuals and researchers to browse the list of people named in the collection and locate these mentions in the database.
Computer-assisted topic modelling will be used as a statistical technique to categorise book pages. This automated process allows the extraction from the documents of a large set of word-clusters that are frequently occurring together and can be seen as topics. According to the needs, 500 to 5,000 semantic themes will be identified in the collection by the statistical algorithm and each will contain a set of ~30 keywords. For example, the topic politics might contain words such as Parliament, party, election, campaign, promise etc. Keywords and the label of each category will be used as tags and will enable the user to search for concrete (e.g. objects) and abstract (e.g. emotions) categories.
Applying computational methods to this corpus will also enable me to trace the evolution of the British language in the various decades covered by this collection. I have a particular interest in going beyond simple data management and produce academic scholarship grasping linguistic trends quantitatively.

URL
https://goo.gl/Fy1jn4 (ID:blparcha Psw:London2016) for an overview of my digital archive (c.f. section on skills) showing how I used and displayed OCRd material before.

The research question / problem you are trying to answer

How to harvest relevant information in an eclectic textual dataset and to display its content online at a minimum cost?


The project is based on a simple idea: images of book pages and select searchable data contained in those (topics, spatial and person metadata) can be combined in order to create a considerable and user-friendly digital library on an almost free platform. The use of the image hosting Flickr website enables the British Library to give access to its out-of-copyright books to a large audience without having to bear the high server-based costs of an in-house digital preservation platform.
Having millions of active members is an assurance that the Yahoo-owned platform is long-lasting and immune to technological obsolescence. The content can be browsed easily by users thanks to the inbuilt Flickr search engine that allows for custom searches and the use of Boolean operators (and, or, not). To my knowledge, no other online platform enables institutional users such as the British Library to display such a large quantity of searchable book pages. However, a possible alternative in the long run would be to approach the Internet Archive electronic library; the latter displays scanned books from libraries such as the Boston Public Library (50,762 items) or the National Library of Scotland (4,859 items) .

Please explain the ways your idea will showcase British Library digital collections

The British Library holds the world’s foremost collection of material for the study of the long nineteenth century (1789-1914) Britain and Ireland. It can be argued that the 65,000 books, which have been digitised between 2006 and 2008, represent a fair amount of the entire collection of this period (~6.5%). However, this abundant content is not accessible to the general public, and the creation of a widely accessible online British Library book collection will showcase this neglected section of the humanities hold by the BL. The potential reach is remarkable and the project can be developed further. As Flickr allows users to tag content on its website, crowd-generated searchable data can be produced overtime in order to enrich the catalogue of metadata of the British Library’s 19th Century Books Collection. Such database will obviously attract millions of viewers from around the globe who will get acquainted with BL holdings (i.e. the British Library image collection on Flickr has over 330 million views ).

Please detail the approach(es) / method(s) you are going to use to implement your idea, detailing clearly the research methods / techniques / processes involved

Indicate and describe any research methods / processes / techniques and approaches you are going to use, e.g. text mining, visualisations, statistical analysis etc.
The completion of the project requires following a five-step data processing pipeline. Each process must run independently from others processes in order to quarantine and fix quickly unforeseen technical issues – caused by coding errors or limitations in software/computer’s working memory.

Step°1. The dataset has to be prepared in order to make it suitable for computerised text manipulation. I need to process the OCRd book pages in order to extract the text from the PDFs and save them in UTF-8 text format and TIFF or JPG photo format. Both files need to have same names in order to be merged together at a later stage. Batch sequences can be run using a JavaScript in Adobe Acrobat Pro DC. Depending on the quality of the OCR, I use a Regular Expressions (Regex) Python module to extract a “cleaner” version of the 19th Century Books text files.

Step°2. From text files I have to extract the tags that will be displayed on Flickr. I have to filter out all the information in the text files that are not relevant. Stanford Named Entity Recognizer (NER) is used to single out and extract names of persons, organisations and locations . Additionally, the Java-based Mallet package is essential to generate relevant topics and the bags of relevant words composing those topics . A word list with these keywords is created and saved in a CSV file. A Python script is used to delete all the words in the files of book pages that have no corresponding entry in the CSV file – a copy of the dataset will be made prior of doing such operation. The remaining words in each text file are the tags associated to the images of book pages.

Step°3. Using Python script, text files are batch encoded in order to fit a metadata template using RDF data model specifications. Then, a unique identifier is generated in order to unambiguously recognise text and image items within the repository. Using a Python script and ExifTool command-line application , words in text files are embedded in the EXIF metadata of book pages images. At this stage an additional copy of the dataset has to be made as the operation override previous metadata contained in images. Each image incorporates now EXIF metadata containing the relevant tags (placenames, organisation names, given names, topics and their related lexicon).

Step°4. The tagged book page images are placed in subfolders in a single directory storaged on a local machine with high storage capacities. Each subfolder contains all the pages of a single book. The directory is then uploaded using the “Uploadr” tool of Flickr , which requires a pro subscription to Flickr (£35 annual plan). The website interprets every folder as a photo “album” that can be browsed separately online.

Step°5. I will use the text files of the collection in order to build a lemmatised corpus suitable for textual analysis. The use of textometric tools such as TXM allows for a periodisation of writings in the collection based on the lemmas used and their frequencies. To explore this issue, correspondence analysis (CA) will be used in order to identify years in which books displayed analogous vocabulary. I will also identify the evolution of word occurrences and use TXM tool to calculate score of specificities across sub-corpuses (organised yearly). This method of analysis is in some aspects similar to the one used by Google “Ngram” but outputs obtained will be far more rich and fine-grained. The results will be published in a leading journal in the field of Digital Humanities such as DSH (Digital Scholarship in the Humanities) or “Mots. ”

Please provide evidence of how you / your team have the skills, knowledge and expertise to successfully carry out the project by working with the Labs team

E.g. work you may have done, publications, a list with dates and links (if you have them)
As part of my doctoral project, I have been collecting and digitising since 2013 a large corpus (around 70,000 documents) of pamphlets, posters, leaflets, manifestos, reports, letters and press releases produced by student organisations in two universities of New Delhi over four decades (1975-2015). I created on Flickr platform the “Pamphlet Repository for Changing Activism” (PaRChA). The use of optical character recognition (OCR) permits the extraction of words from the printed material; it makes the huge body of texts suitable for custom searches and statistical analysis through the use of various computer-assisted tools. It enables researchers to gain systematic qualitative and quantitative knowledge on micro and medium-scale countercultural student movements, the claims they make and the evolution of pamphleteer-language repertoire through time. The project can be accessed here: https://goo.gl/Fy1jn4 (ID: blparcha, Password: London2016). In order to achieve this I learned Python programming language and deepened my understanding of metadata management. Moreover, I recently presented preliminary results of the textual analysis of the PaRChA corpus . The paper will be published in the proceedings of the International Conference on Statistical Analysis of Textual Data in June.


FIGURE 1: OCRd 1975 pamphlet and its metadata in the PaRChA archive

FIGURE 2: Results of the search “British + Library” in the PaRChA archive


FIGURE 3: Different sub-collections of the PaRChA archive ordered per authoring political organisation
Please provide evidence of how you think your idea is achievable on a technical, curatorial and legal basis

Indicate the technical, curatorial and legal aspects of the idea (you may want to check with Labs team before submitting your idea first).
The main challenge posed by this project is related to the size of the digital collection. Speculatively, the total data to be uploaded is 35,000 GB. Fortunately, Flickr has proven to be a platform suitable for the display of very large datasets such as the Internet Archive Book Images (5,335,570 items) and the British Library’s Flickr Commons images (1,023,705 items). I’ll however suggest the British Library to foster institutional ties with Flickr in order to benefit from their technical support. The automatic upload of the entire collection might be a lengthy process (up to two years), the core aspect of this project is therefore to prepare all the needed metadata and insert them in the book images before putting them online.
Pre-upload data management of 25,000,000 documents is also a potential issue. In order to overcome problems related to the manipulation of such amount of items, each step of the data treatment (c.f. methodology of this project) has to be dealt with separately. Furthermore, preliminary work has to focus on a smaller sample of the collection in order to test the workability of each stage of the data processing.
Regarding the legal aspects, Flickr offers the possibility of displaying a licence next to the metadata of each book-page photo . As the content of the 19th Century Book Collection is out-of-copyright the British Library can display them freely and decide whether it wants the content to be available for download or not (this option is available on Flickr). Furthermore, I had the chance to meet with Aquiles Alencar-Brayner, Digital Curator in the Digital Research team, and we discussed the feasibility of the project in details.

Please provide a brief plan of how you will implement your project idea by working with the Labs team
You will be given the opportunity to work on your winning project idea between 26 May - November 4 2016

The main phase of development will happen from the beginning of July to the beginning of November.
Prior to this I will approach the Digital Research team and the Technical Lead in order to get a first-hand on the data of the 19th Century Book Collection. As the metadata, images and PDFs of the book pages are stored locally I will have to assess the technical requirements of the project in situ. It is of prime importance for me to meet with the curator in charge of the British Library image collection on Flickr and get from her the metadata template that was used for the upload of the material. I would also like to approach Katrina Navickas and the person in charge of the BL geo-referencer in order to learn more about the language processing tools involved in extraction of place names.
In July, I will create pilot Python scripts and test the outputs of Steps one to four (c.f. methodology section) on a small sample of the collection – i.e. 10,000 items. I will present my results to the BL Labs team and do necessary revisions.
In August and September I will process the data of the entire database. I will organise the text and images and extract suitable tags for the online database (Steps 1 & 2). Then I will prepare and start uploading the collection (Steps 3 & 4) on Flickr. As previously mentioned, due to the size of the database, the automatic uploading process might take up to two years to migrate the totality of the material online.
After the completion of the online section of the project I will perform in October the text analysis of the entire corpus and write subsequently the research paper tracing the linguistic changes in the collection across time (Step 5).