Explore the Stacks (Category: Research)

Name of Submitter(s): Mark Hall
Organisation: Edge Hill University / Department of Computing

Digitisation efforts such as the British Library's release of a million images have created massive digital collections and made our cultural heritage much more available to the public. However, making the digital collections available is not the same as making them accessible. The reason for this is that when the collections are made available, they are made available through a search system. This works well for the curator who knows what can be found. It also works for the expert, who knows what keywords to use. However, for members of the general public, who have neither of these advantages, the white search box provides an almost insurmountable obstacle and as a result the collection remains unused.
The aim of the "Explore the Stacks" project is to provide alternative access methods to open up the collections to the general public. In the transition from the physical to the digital representation the ability to experience a collection by browsing through the shelves or stacks has been lost. This browsing experience offers significant benefits for the novice user. First it does not require any knowledge of what is in the collection to be able to find something. Second it allows the user to find things that they were not looking for, but which they have serendipitously discovered while browsing or looking for something else.
To enable a similar browsing experience in the digital world, the million image data-set has been processed and the shelfmarks for each book extracted. The books were then sorted by shelfmark and grouped into sets of between 150 and 200 (a shelf). The shelves were then further aggregated into sets of 50 shelves (a corridor), creating a total of 4 corridors. This creates a hierarchical structure that organises the books by their shelfmarks and that can then be exploited to create a browseable interface.
Using this hierarchical structure, an interface has been created that allows the user to explore the collection without needing to use any search terms. Due to the ordering structure of the shelfmarks, it is possible not only to drill down into the collection, going from corridor to shelf to books, but also from shelf to the next shelf. Furthermore the user can also enter a search term into the search box. Rather than showing the matching books as a list, it simply highlights the matching corridors, shelves, and books in the interface. This way the user can see what is available, but the opportunity for serendipitous discovery is maintained.
URL for Entry: http://ir.computing.edgehill.ac.uk/apps/explore-the-stacks

Email: Mark.Hall@edgehill.ac.uk

Twitter: hallicek

Job Title: Senior Lecturer in Computing

Background of Submitter:

Mark Hall a senior lecturer in the Information Retrieval research group at Edge Hill University, with a research focus on human-computer interfaces for overviewing large collections of data and on the evaluation of interactive information retrieval systems. In previous work at the University of Sheffield, the UK National Archives, Cardiff University, and the Alps-Adriatic University of Klagenfurt he has also investigated web-search sessions, geo-referencing of historical texts, modelling people’s uses of vague spatial language, and ontology alignment questions. He has worked as a software developer, developing web-based applications ranging from simple data visualisations to online course management systems.
http://www.edgehill.ac.uk/computing/mark-hall/

Problem / Challenge Space:

Digitisation has created massive digital collections of cultural heritage data and artefacts. The problem is that the massive scale of available data prevents members of the general public from accessing the data, as they do not have the specialist skills necessary for successfully retrieving data they are interested in from the collections. This project addresses this challenge of opening up collections to the general public by providing an alternative browsing interface.

Approach / Methodology:

The project combines different methods. Initially text processing was applied to extract the shelfmarks, which were then aggregated into the following hierarchical structure:
  • everything
  • corridors consisting of 50 shelves
  • shelf consisting of 150 - 200 books
  • books
For each shelf the contents of all its books was aggregated and set as the shelf's content. Then the shelves' content was aggregated to create the corridor's content. The books', shelves, and corridors' content was then indexed using ElasticSearch to support the search functionality.
Additionally for shelves and corridors Term Frequency - Inverse Document Frequency was used to calculate the most salient keywords for each shelf and corridor. This is used in the interface to provide an indication of the kind of topics contained in that shelf or corridor.

Extent of showcasing BL Digital Content:

The project is based solely on the million images release on Flickr and the matching meta-data release on GitHub.

Impact of Project:

The project's work has been published and presented at the Digital Libraries conference 2014:
M. M. Hall, “Explore the Stacks: a System for Exploration in Large Digital Libraries,” in 2014 IEEE/ACM Joint Conference on Digital Libraries (jcdl), 2014.

Issues / Challenges faced during project(s):

The biggest challenge in the development was linking the database that contained the hierarchy and books that are used to create the browsing interface with the ElasticSearch index that powers the search. Particularly highlighting which corridors / shelves contain a given search term was complicated. In the end each corridor / shelf was indexed using all the text content of all the books that are contained within that corridor / shelf. This guarantees that if a book is in a shelf / corridor, then searching for that book's content will also result in the shelf / corridor being found and highlighted.