| Competition | Previous Entries & Ideas | Digital Collections | TOCs | FAQs | Judging | Resources and Tools | Submit Entry | Events |

Emanuil Tolev

Submitted Entry for 2013 Competition

British Library Digital Collections Inter-Collection Link Service and Europeana Integration

Abstract

The inter-collection link service shows how items in the British Library Digital Collections are linked together using cutting edge open source search and indexing technology. The project will specify and attempt to build the following tools to realise this idea and bring greater exposure to the British Library Digital Collections (in order of development):

1/ A bare-bones service which shows how items reference each other. Given the name of an item or an author (or any word), it returns a list of Digital Collections items whose metadata contains that name / author / other word.

2/ Integration with the Europeana EU cultural portal project. This would allow users of Europeana to see British Library items which refer to the Europeana object they are viewing. The British Library is already a partner in Europeana and items submitted by the British Library are hyperlinked to the Digital Collections - however, this idea would allow looking for links between any Europeana object and the items in the Digital Collections and hyperlinking them in an automated fashion.

3/ Visualise collection-level links. For example, how many times (and where) are Victorian Popular Music pieces referred to in the 19th Century Books collection?

4/ Items 1-3 work with the metadata of all British Library Digital Collections. The last phase of the project would attempt to build a lightweight framework for analysing the full text of Digital Collections text items. Providing real-time search capabilities on a whole collection has many implications, but in relation to this idea, it will be possible to find out which text items refer to a name or term of interest in their full text as well as the metadata, providing for much deeper links between the collections (and Europeana objects).

Each stage builds on the previous one. Thus, even if difficulties in the the exploration-intensive first stage delay the work, the project would still bring exposure to the Collections as well as enhanced researcher experience. This is because the modular approach to development would allow (for example) completing items 1 and 2, as well as laying solid foundations for item 3 and related innovation in the future.

All stages would provide immediate greater exposure and enhanced researcher experience. However, stage 4 is of particular interest since a lot of future development in the area of research support can be based on the idea of providing fast, yet relatively inexpensive search over available full text. The author has particular experience with an up-and-coming open source piece of software which would make this usually technically difficult endeavour achievable within the 4 month scope of this project.

Assessment Criteria

The research question / problem you are trying to answer:

Please focus on the clarity and quality of the research question / problem posed:

Investigating the internal and cross-collection relationships between items in the British Library Digital Collections and the Europeana collections

Please explain the ways your idea will showcase British Library digital collections

Please ensure you include details of British Library digital collections you are showcasing (you may use several collections if you wish), a sample can be found at http://labs.bl.uk/Digital+Collections

Europeana is an European digital cultural portal which enables access to millions of digital representations of cultural artefacts. The British Library itself is a very important partner in the project and has submitted a number of collections.

While it is currently possible to search Europeana and the British Library's digital collections metadata for a particular topic of interest, more could be done to actually link the collections together. Scholars will often want to know more about an object which they deem useful for their research, but deep exploration is difficult. For example, a scholar researching the life of 19th century entrepreneur William Alexander Madocks will quickly find the painting "The Embankment, Traeth Mawr, Tre-madoc, North Wales" on Europeana [1] by searching for the subject's name. The resulting page offers the reader to "view item at The British Library".

It is at this point where the scholar will have the opportunity to engage much more deeply with their research subject by exploring The British Library's digital collections. The inter-collection link service would allow Europeana to display a suggestions box titled "References to this item within The British Library's Digital Collections".

The suggestions presented would depend on whether the attempt to process the full text of the textual British Library collections is successful.

a) If so, then the suggestions shown will include links all books and other text items which mention the item the researcher is looking at in their actual content.

b) If not, then the suggestions shown will include links to all text items whose metadata mentions the target item.

Note that even if option b) comes about, the resulting links between British Library Digital Collections items and the integration with Europeana will present a significant enhancement to the researchers' experience. A scholar could, in theory:

1/ Search for the topic they are interested in

2/ Find a related item on Europeana or in the British Library Digital Collections

3/ Perform a secondary search using the item's title this time

However, there are a limited number of manual searches that a scholar will have time for. Suggestions, on the other hand, present themselves immediately for evaluation and can be a direct source of inspiration. The data being used in option b) would be the same, but the way it is presented would not be as search results of a manual search.

Of course, if option a) is discovered to be technically plausible and a full-text index is made available, then exposure of British Library Digital Collections content could be increased dramatically as it would be possible to say which texts refer to cultural objects in Europeana. (Or which texts refer to other items in the Digital Collections, with the possibility to map out a dense network of cross-collection connections using only the British Library data. However, Europeana might bring cross-collection links closer to the everyday workflow of researchers and other stakeholders.)

[1] Billington, H., "The Embankment, Traeth Mawr, Tre-madoc, North Wales", 1881, http://www.europeana.eu/portal/record/92037/6A53F57926C98F8821C1E3C722B7C2D7A4D9C9B1.html
Please detail the approach(es) / method(s) you are going to use to implement your idea, detailing clearly the research methods / techniques / processes involved*

Indicate and describe any research methods / processes / techniques and approaches you are going to use, e.g. text mining, visualisations, statistical analysis etc.

Stage 1 (as described in the abstract) would include desk-based research (of available metadata) and text mining and indexing to create the basic service which underpins the other stages.

Stage 2 would include collaborating with the Europeana project and bespoke open source software development to allow Europeana to integrate the results returned by the Stage 1 service into their user interface.

Stage 3 would include visualisation and some analysis of the data now available through the Stage 1 service to determine interesting cases to visualise, as well as minor bespoke open source software development to create an index of aggregate data (an intermediate step when visualising the data, since the project would essentially be generating collection-level data from item-level data).

Stage 4 would include analysis of the textual data available and bespoke open source software development to allow effective indexing and search of the data, as well as integration of the newly available data with the previous stages of the project (through feeding into the Stage 1 service).
Please provide evidence of how you / your team have the skills, knowledge and expertise to successfully carry out the project by working with the Labs team*

E.g. work you may have done, publications, a list with dates and links (if you have them)
The author has participated in a number of Open Knowledge and Open Higher Education projects in the past few years. All of the work has been undertaken in the context of working with the Cottage Labs partnership of software developers dedicated to Open Knowledge and Open Source. This is a team which can be relied on for pieces of this work as well, even though the author will be the main developer working on the project.

The technologies used are stated next to each project title, in case this application is reviewed by technical developers. The "elasticsearch NoSQL storage" is the cutting edge search and indexing software referred to in the abstract.

1/ Jorum Paradata (Python, Ruby, Shell, elasticsearch NoSQL storage): a mid-size project with Mimas (a national UK data centre, with focus on education) focussed on Open Educational Resources. Cottage Labs developed a two-tier web application which included a dashboard visualising the usage data of resources across time and by country. The author worked on thoroughly deduplicating and enhancing the app's main dataset of 15'000 records, including merging, canonicalisation, validation and enriching the dataset by adding information from external data sources.

http://beta.jorum.ac.uk/

2/ OpenArticleGauge (Python, Flask, Celery, elasticsearch NoSQL storage, Shell): high-capacity web application which tries to find out the license of a given scholarly article from its identifier. Developed in partnership with Public Library of Science. The author worked on publisher-specific plug-ins which do the actual license guessing via page scraping and consuming publishers' API-s, where available.

http://oag.cottagelabs.com/

3/ FundFind (Python, Flask, elasticsearch): The author led the development of this web application, sharing and powerful searching of current funding opportunity data aimed at junior and senior scholars.

http://fundfind.cottagelabs.com/

4/ IDFind: The author finished the development and took on maintenance of this web application. It crowdsources identifier information - e.g. "what does a Digital Object Identifier or a ISBN look like?". Users can then give it a string that they believe is an identifier, but are unsure of what it is. The app is aimed at the text mining / PDF processing / page scraping community within the Open Knowledge field, as well as the occasional scholar.

http://test.cottagelabs.com/idfind/

All our projects have an associated RESTful API which allows them to function as modular services whose data can be gathered automatically and reused. We design the API right into the architecture of all applicable web projects from the very start.

Two projects we have significantly contributed to may be of particular help in this case:

5/ Bibserver, an OKFN project, ( https://github.com/okfn/bibserver ) is a tool which manages collections of data. It was initially intended to manage bibliographic data, but has proven (in the OpenArticleGauge project) to be extremely useful as an archive layer for applications who need to deal with a lot of data and still perform well. It relies on the elasticsearch NoSQL data storage server to achieve great response times and our expertise with using elasticsearch can help this project store and analyse data regardless of whether Bibserver is used.

6/ Facetview, another OKFN project ( https://github.com/okfn/facetview ) is a Javascript web user interface component which works with elasticsearch to allow the intuitive exploration of data via automated data analysis called "faceting" (analysing what values a particular field holds across a large dataset). The real-time search responses combined with the ability to filter on facets make drilling down on interesting data much easier. A good demonstration on a large dataset can be seen here: http://beta.jorum.ac.uk/find

This is particularly relevant to exploring currently available Digital Collections metadata and selecting collections for visualisation for Stage 3.

Please provide evidence of how you think your idea is achievable on a technical, curatorial and legal basis

Indicate the technical, curatorial and legal aspects of the idea (you may want to check with Labs team before submitting your idea first).

Technical

There are three main technical aspects to the work.
1/ Processing the metadata of British Library Digital Collections for the Stage 1 service.
The author has worked with metadata from a variety of fields (mainly Open Educational Resources, scholarly publications and scholarly funding) in a variety of formats (CSV, JSON, XML and particular dialects such as BibJSON and LOM).

The elasticsearch software which the author currently intends to use to store the Digital Collections metadata has been used by Cottage Labs in the past with great success, including for doing a full-text search on the metadata of the whole MEDLINE dataset which returned results within seconds (20 million records about scholarly articles and other artefacts from medical scholarship).

The OpenArticleGauge open source project mentioned above tries to enhance and create metadata from nothing more structured than HTML pages published by various academic publishers. It uses caching (with expiration times) and longer-term archiving to keep this metadata up to date without having to re-process all the source information every time a request is made. These techniques are directly applicable to keeping an index of the Digital Collections metadata for analysis and finding links between items.
2/ Processing the full text of text Digital Collections
The same notes apply as for processing the metadata presented in 1.

However, the full text of a Collection can be quite large. Elasticsearch is designed to deal with a large number of data (it handles storage of the logs of popular websites quite easily, and these can run into a few gigabytes) and is very specifically designed to deal with data with holes or unclear structure. However, the author has not attempted to process 25 million pages (like in the 19th Century Books Collection) before and it may turn out that it is possible, but there are higher computing resource costs associated with doing so. A successful attempt would be of great immediate and future benefit to research support tools based on the text Digital Collections.

At first, only one of the text Collections will be processed on a full text basis. The 19th Century Books Collection seems like a good candidate as it seems that references to cultural artefacts inside these books will be of interest to many scholars. However, this choice can naturally change after consulting with resident developers and scholars at the British Library, or at the Library's behest.

Curatorial

No new information will need to be collected or curated by the British Library Labs team(s) as part of the project.

Access to the metadata of all British Library Digital Collections which the British Library Labs team(s) would like to be processed will be needed.

Access to the full text of all text British Library Digital Collections will be needed for Stage 4 of the project. This may include local access if the size of the data prohibits effective processing, although the project software will be capable of piecemeal processing over a longer period of time if needed (as the underpinning technologies are).

Legal

The project will build generic software and tools usable on all Digital Collections. However, initial development will peruse Collections which are clearly out of copyright as advised by the British Library Labs team(s).

Europeana publishes all its metadata under the terms of the Creative Commons Zero Public Domain Dedication (CC0) (see http://pro.europeana.eu/web/guest/licensing ).

Of course, the software will be usable on copyrighted materials within the British Library as well, if the Library's own development team(s) wish to do so even in cases where the public does not have any rights to the materials.

The author would strongly prefer that all software outputs are open source under a permissive license such as the MIT license, the new BSD license or the Apache 2.0 license (or other), depending on the legal requirements of the British Library itself. The whole scholarly sector and the Open Knowledge sector in particular have greatly benefited from the permissive sharing and reuse that such licenses allow. The author is ready to present more information in support of this point if required.

Please provide a brief plan of how you will implement your project idea by working with the Labs team

You will be given the opportunity to work on your winning project idea between July 6th - October 31st 2013

NOTES: All tasks to be performed by the author unless otherwise stated. Expertise may be brought in from Cottage Labs LLP if required. If this is done, it will be done in a transparent and clear way, such that the British Library Labs team(s) do not suffer from increased communication overhead on this particular project. Of course, where a task requires data, it is dependent on the British Library providing the data, after which the task can proceed. Some of the tasks can be executed in parallel, or initially with a sample of the data (e.g. development can proceed on one collection and later expand to cover more as metadata becomes available).

July 6 - Onwards
1/ Desk-based research and analysis of the available metadata for Digital Collections
2/ Procuring a sufficiently powerful computer system with enough operating memory to hold an index of this metadata
3/ Indexing of the available metadata into elasticsearch
4/ Contacting Europeana and letting them know of the idea (preferably with an example of how references to an Europeana object can be found in the British Library metadata), as well as figuring out a convenient time for collaboration on Stage 2 of the project.
5/ Investigate basic properties of the full text of text Collections, such as availability (of the actual text, not OCR data) and size

August 2013
6/ Finishing and automating all indexing of available metadata (scope to be determined in collaboration with the British Library Labs team(s)).
7/ Providing documentation on including new metadata in the index. This completes Stage 1.
8/ Integrating the outputs of Stage 1 with Europeana, completing Stage 2.

Europeana might take a while to deploy the results since they need to fit them with the current site. If they do not wish to integrate the results from the search service (Stage 1) into the main Europeana site, a separate interface showing where Europeana objects are referenced in the British Library metadata will be provided.

9/ Setting up an aggregate index of the Stage 1 service data (how many links go over collection boundaries and similar measurements)
10/ Progressing the work on integrating full text as a data source for the Stage 1 service (in addition to the metadata). Attempt to index 1/20th of the available data to evaluate technical feasibility (and avoid large data transfer problems).

September 2013
11/ Selecting appropriate visualisation examples by analysing the aggregate index and main index in the Stage 1 service
12/ Choosing appropriate visualisation (author usually uses https://github.com/mbostock/d3/wiki/Gallery to choose due to previous experience with this well-known software library)
13/ Developing visualisation (usually mostly involves converting the data into a suitable representation / structure since the actual aggregate data will be available by this stage). This completes Stage 3 of the project.
14/ Setting up monitoring. Since there are two indices, one service and a visualisation running by this point, these need to be transitioned from development to production, a step which is often overlooked. This might take some time as decisions are made where the services should be hosted - for example, Cottage Labs have previously sourced computers and looked after services after the development was over, in cases where it would have been more expensive for the client organisation to do so. Hosting on sufficiently powerful British Library equipment is also fine, but time needs to be allotted for collaboration with the Library's ICT team(s).
15/ Progressing the work on integrating full text as a data source for the Stage 1 service (in addition to the metadata). Integrate the results of step 10 into the Stage 1 service, if successful.

If the full text work is successful, at this point it should be possible to see some references to British Library and/or Europeana items where the reference is only made in the text of a work held at the British Library.

Additionally, attempt to index a further 9/20th of the available data if previous work on this line has been successful.

October 2013
16/ Publicising the outputs of Stages 1-3. This needs to be done in collaboration with the British Library's own communications team(s).
17/ Attempting to index the last 11/20th of the available text Digital Collections data.
18/ Integrate the results of the steps 15 and 17 into the Stage 1 service, thus making references in text Collections available.
19/ Publicise the work conducted at relevant events.