Mikko Tolonen, Niko Ilomäki, Leo Lahti

Submitted Entry for 2015 Competition

Abstract

The British Library (BL) - and the world - is full of digital and analogue document collections and their pertaining metadata. Librarians, researchers and the general public would benefit immensely if they were able to analyse these different collections also statistically. Use your imagination and take a look at the summary that we have created for one BL data collection based on the reproducible research environment that we propose in this competition entry:

https://github.com/rOpenGov/estc/blob/master/inst/examples/summary.md

Now extend that thought and consider the fact that a similar analysis could be rapidly carried out for any similar collection, maintained by BL or other parties, in the future using these tools. This is what REBL has to offer.

This project provides comprehensive research algorithms to automate bibliographic data analysis from raw data to final visualization and online reporting. As a metadata project, REBL continues on the path in the BL Labs competition previously ploughed by the sample generator. Our toolkit can convert raw data into analyzable format, carry out statistical analysis and visualization, and support the sharing of fully reproducible and interactive research reports online, following the similar approaches that revolutionized computational biology in the last decade. A major bottleneck is that the original data collections, often plain text, are rarely as such amenable for statistical analysis. Moreover, the capabilities of ready-made tools are often limited to a particular application. Hence, there is a great need for flexible statistical research tools to automate the overall data analysis workflow.

We will showcase this by statistical analysis of the early modern book production with the English Short-Title Catalogue (ESTC) data 1470-1800. We have completed the pilot phase based on a history-focused subset provided to us by BL including c. 50,000 documents. The relevance for the British Library is that this improves the quality of BL data collections for practical data analysis as we convert all data fields of major research interest into harmonized data tables suited for statistical analyses (see our paper consumption across time in github/estc). In the BL Labs competition, we will start by scaling up this analysis to cover all 466,000 documents in ESTC to provide a comprehensive analysis of knowledge production in the early modern era. In the second stage, we will apply this framework to further data sets (e.g. the library catalogues of Sir Hans Sloane and/or George III Collection: King's Library) to demonstrate the scalability of the tools to other BL collections.

URL for project:
https://github.com/rOpenGov/estc/blob/master/README.md

Assessment Criteria

The research question / problem you are trying to answer

Please focus on the clarity and quality of the research question / problem posed:

How to take most advantage of library catalogues as a key data resource for quantitative research? The project represents a novel and powerful research paradigm as well as a practical implementation of algorithmic research tools for the analysis of large-scale historical document collections and associated metadata. This opens up novel perspectives on characterizing how social transformations are reflected in knowledge production and in different publishing genres.

Please explain the ways your idea will showcase British Library digital collections

Please ensure you include details of British Library digital collections you are showcasing (you may use several collections if you wish), a sample can be found at http://labs.bl.uk/Digital+Collections

We showcase the analysis of the British Library ESTC library catalogue by providing a rigorous quantitative perspective on early modern book production based on the available information in the ESTC and supporting data sources from the BL and the public domain. We expect that this will be of great help to BL encouraging research on the ESTC because it provides automated tools to obtain polished data sets that are readily amenable for statistical analysis, as well as new historical knowledge and a benchmark model that can be further expanded into a comprehensive cross-European model including also other library catalogues. We highlight many previously unknown quantitative aspects concerning the development of the print media in the early modern era. The showcased tools can be also used to share the analysis results online in the form of an interactive application, where the researchers can browse statistical summaries of BL data sets, and potentially upload their own data sets to carry out similar analysis tasks.

Towards the end of the project, we will demonstrate the scalability of our approach on another British Library collection. In September, we can also organize a hackathon where the participants can test our toolkit on the ESTC data collection on their own laptops.

Please detail the approach(es) / method(s) you are going to use to implement your idea, detailing clearly the research methods / techniques / processes involved

Indicate and describe any research methods / processes / techniques and approaches you are going to use, e.g. text mining, visualisations, statistical analysis etc.

REBL introduces for the first time an open source statistical ecosystem for quantitative analysis of large-scale library catalogues, with a particular focus on the British Library ESTC collection. Our approach builds on four key elements: 1) Research data (ESTC) and supporting data sources; 2) Statistical open source libraries (R packages that we are making available already now in their pilot phase at https://github.com/rOpenGov/estc ); 3) Distributed version control (Github); and 4) Reproducible document generation (Rmarkdown) that provide a way to seamlessly integrate the analysis from raw data to preprocessing, statistical analysis and summaries, visualization and the final reporting. We have already obtained the bibliographic metadata for the ‘history’ subset of the ESTC (c. 50,000 documents), and by participating in this competition we are looking forward to scale up the work to cover the full ESTC and other BL data collections.

Please provide evidence of how you / your team have the skills, knowledge and expertise to successfully carry out the project by working with the Labs team

E.g. work you may have done, publications, a list with dates and links (if you have them)

MT (Mikko Tolonen) is a Professor of research on digital resources with a background in book and intellectual history. LL (Leo Lahti) has a doctoral degree and long-term experience in developing related computational ecosystems in bioinformatics and statistical machine learning. NI (Niko Ilomäki) is a talented undergraduate student who is completing a double degree in the fields of History and Mathematics. This team has a strong track record and cross-disciplinary expertise to carry out the project. The project contributes to and draws further support from to the rOpenGov statistical ecosystem (https://ropengov.github.io ) founded in 2010 by LL et al. The other rOpenGov tools, such as tools to visualize geospatial data, will support our work on the BL collections.

Links to research profiles (including publications):
MT: https://tuhat.halvi.helsinki.fi/portal/en/persons/mikko-sakari-tolonen%281f6c4343-d64e-48d5-b6af-39d5f3442502%29.html
LL: http://www.iki.fi/Leo.Lahti

Please provide evidence of how you think your idea is achievable on a technical, curatorial and legal basis

Indicate the technical, curatorial and legal aspects of the idea (you may want to check with Labs team before submitting your idea first).

Technical
All required information processing tools are readily available as open source and already proven to work in other fields of science, and the project team has full capabilities and strong track record to carry out the proposed project. The project team involves experts in book history (MT), computational science (LL), and a student of Mathematics and History (NI). The pilot phase has been successfully completed and the extension plans to cover the whole ESTC data collection are clear.

Curatorial
The curators Karen Limper-Herz and Iris O’Brien, responsible for curating ESTC data collection, have confirmed that we can use the whole estc-catalogue data for this project. Completing the work will include polishing the R package, providing reproducible documentation, and carrying out the statistical analyses and reporting. These tasks are compact and well-defined.

Legal
The research data (ESTC library catalogue) is available only for research purposes. We will keep the data confidential, while statistical summaries can be published. All code developed within this project will be released under an open source license (2-clause BSD).

Please provide a brief plan of how you will implement your project idea by working with the Labs team

You will be given the opportunity to work on your winning project idea between June 2015 - October 2015.

June 2015 [Scaling up the analysis]
MT Request access to the full ESTC library catalogue with metadata for 466,000 documents.

LL, NI: Scale up the analysis code already implemented with the 10% history subset (https://github.com/rOpenGov/estc/ ). Presumably some fine-tuning to the present code will be required to scale it up to the whole data collection but this is expected to be minor effort compared to the initial implementation which is already complete.

July 2015 [Data pre-processing]
MT: Identifies data fields that are of particular interest for the proposed research (these include at least author information, publication year, place and title, document pages and dimensions and related fields. MT will also help to validate the results and find supporting data.
LL: Preprocessing of the complete ESTC. Release the source code as an R package in Github.

August 2015 [Data Analysis]
LL: quantitative analyses and summaries, completing the reproducible documentation that allows fully automated and transparent analysis and makes every single detail of the analysis from the raw data to final summary document publicly available.
MT: hypothesis formulation and qualitative historical interpretation.

September 2015 [Documenting the tools]
LL, NI: Polish the final source code and provide clear and comprehensive documentation together with simple reproducible examples at https://github.com/rOpenGov/estc/blob/master/vignettes/tutorial.md
MT: curate the documentation and test the tools to ensure that it is applicable also by researchers without technical background.
LL, MT: in accordance with BL demonstrate the scalability of our approach by implementing it on another BL collection (e.g. Hans Sloane’s Library and/or King’s Collection)
LL, MT, NI: At this point, we can also organize a hackathon together with BL Labs Team on our tools at BL so that other people can work with the data and the toolkit.

October 2015 [Reporting the results]
MT, LL, NI: summarize and report the project; advertise the generated results and tools in social media
MT, LL, NI: Submit an article to a high-level peer-reviewed journal.