| Competition | Previous Entries & Ideas | Digital Collections | TOCs | FAQs | Judging | Resources and Tools | Submit Entry | Events |

Matt Prior

Submitted Entry for 2013 Competition

The Emergence and Distribution of Graphical Content within 19th Century Books

Abstract

As printing technologies advanced they gradually provided the ability to integrate graphical material within texts. The chronological, and potentially geospatial, evolution of that technology can be investigated by analysing the existence, size and distribution of images within texts.

The British Library's digitised 19th Century Books collection has metadata produced during the scanning process. This contains information about the text itself and separately lists the page numbers and page coverage of graphical elements. This project is intended to extract information from the metadata to form a high level dataset of graphical content within the texts.

The subsequent chronological and statistical analysis of this dataset would provide a series of visualisations providing insight into the evolution of this technology. Additionally, there is some limited geospatial data available so a crude analysis of geographical spread would be possible. Further decomposition and classification using alternate metadata tags might also prove insightful

An additional component of the analysis could include a predictive system which would model the likelihood and distribution of graphical content within an unseen volume from the period covered by the collection.

Assessment Criteria

The research question / problem you are trying to answer*

Please focus on the clarity and quality of the research question / problem posed:

To catalogue the emergence and distribution of graphical content within 19th century books.

Please explain the ways your idea will showcase British Library digital collections*

Please ensure you include details of British Library digital collections you are showcasing (you may use several collections if you wish), a sample can be found at http://labs.bl.uk/Digital+Collections

The digitised 19th Century Book collection is a sizeable dataset, totalling some 30TB. With such large datasets, summary data is a key element in understanding the broader implications of the collection as a whole.

The aim of this project is to provide a high level view of the evolution of graphical content in 19th century texts.

It is hoped that this resource would provide a valuable tool in understanding the evolution and spread of print technology throughout the time period covered by the collection.

Please detail the approach(es) / method(s) you are going to use to implement your idea, detailing clearly the research methods / techniques / processes involved*

Indicate and describe any research methods / processes / techniques and approaches you are going to use, e.g. text mining, visualisations, statistical analysis etc.

The general approach is to mine the XML based metadata from the British Library 19th Century Book collection and extract the size and page locations of the graphical content within the volumes.

Once this data has been collected, the reasonably large sample size - 68,000 texts, will make it amenable to a wide variety of time series and statistical analysis.

Tools including Mathematica, WEKA, Tableau and custom designed libraries will provide modelling techniques such as regression analysis, and classification methods such as random forests, neural networks and support vector machines to give insight into the data.

A range of advance visualisations based upon chronology, geography, publisher, volume type and size can be derived from the final data and implemented in mathematica or Tableau.

A prototype metdata extraction program was developed at the British Library Labs Hack event in collaboration with the curator.

Please provide evidence of how you / your team have the skills, knowledge and expertise to successfully carry out the project by working with the Labs team*

E.g. work you may have done, publications, a list with dates and links (if you have them)

I hold degrees in Electronics ( BSc. ), Machine Learning and Digital Signal Processing ( MSc. ) and I am writing up my PhD. in Machine Learning.

Previously I have worked at BP, IBM, AT&T, Racal and in Journalism.

Please provide evidence of how you think your idea is achievable on a technical, curatorial and legal basis*

Indicate the technical, curatorial and legal aspects of the idea (you may want to check with Labs team before submitting your idea first).

Technical

The extraction of the location and position of images within the texts was demonstrated in prototype form at the Labs Hack event. Some estimation of scalability would be required to process the full metadata set. At somewhat over 1TB it isn't unreasonably large and as no in memory processing is required to distill the metadata down to the final dataset, it is amenable to map/reduce approaches offered by Amazon Web Services or even a cluster of workstations with something reasonably simple like RabbitMQ/Celery job scheduling.

Analysis of the final data is perfectly tractable in applications such as R, Mathematica or Tableau.

Curatorial

The metadata is reasonably self contained and it is not expected that will be significant demand on curatorial resources, though guidance and possible enhancements to the core idea will be welcomed.
Legal

The project focuses on the metadata relating to the texts, the large majority of which will be out of copyright. At this stage it is not anticipated that there will be significant legal issues relating to the work.

Please provide a brief plan of how you will implement your project idea by working with the Labs team*

You will be given the opportunity to work on your winning project idea between July 6th - October 31st 2013

July 6 - Onwards
Activity described here (e.g. what, when and by who)

Data access/Transfer
Computational Resource Estimation
Access to meta for assessment of data cleanliness and degree of missing data
Graphic content data extraction from metadata

August 2013
Activity described here (e.g. what, when and by who)
Graphic content data extraction from metadata
Statistical analysis of data

September 2013
Activity described here (e.g. what, when and by who)
Time series analysis of data
Visualisation

October 2013
All activities to be performed by M.Prior.