Andrew Gustar

Submitted Entry for 2014 Competition

Abstract

This proposal is for a ‘big data’ analysis of the usage of the BL music collections, combining data from the printed music catalogue, book ordering data (i.e. which music items were requested in the Reading Rooms), and anonymised Reader records. This will shed light on questions like

• How much of the music collections is ever / never looked at?

• What are the characteristics of the items that are most / least popular (for example in terms of composers, region, period, publication date, format, or musical genre)?

• What are the characteristics of the Readers who consult different types of music (by age, sex, location, level of activity, etc)?

• How have the answers to these questions changed over time (depending on how many years Book Ordering data is available)?

The analytical approach will depend on the nature of the data, but will include various methods for finding patterns and clusters in ‘Big Data’, and for visualising the data using network analysis, scatter plots and other appropriate techniques. Visualisations will be an important aspect of the output of the work.

It may be possible to extend the analysis to include data relating to inter-library loans or the accessing of digital content. It might also be possible to extend the analysis to cover other datasets such as the Music Manuscripts catalogue or the BL Sound Archive.

This analysis will provide a deeper understanding of the usage of the music collections and will thus inform decisions such as how the collections are organised, which items might be most usefully digitised in future, and how the BL’s music collections might be most effectively showcased. It will provide a valuable opportunity to test the use of Book Ordering Data and Reader Records (including addressing issues such as ensuring Reader anonymity). A by-product will be some analysis of data quality issues in the music collections (inconsistency and duplication), and some recommendations on how these might be improved. The methodology and conclusions could be usefully applied to other collections, or used by other libraries, and it is proposed that be submitted for publication in an appropriate journal after completion of this project.

Assessment Criteria

The research question / problem you are trying to answer

Please focus on the clarity and quality of the research question / problem posed:

What sort of items in the BL music collections are consulted most and least frequently? Which groups of Readers consult different types of music item? How has this changed over the period for which Book Ordering data is available?

Please explain the ways your idea will showcase British Library digital collections


The scale and diversity of the BL’s printed music catalogue will be displayed by this analysis. By investigating usage patterns and relating it to Readers’ characteristics, the project will quantify the important role of the BL’s collections in musicological research. By highlighting both the very popular and very obscure parts of the collection, and by providing a better understanding of the needs and interests of Readers, the results of the analysis will inform future decisions about which parts of the collection might be most usefully digitised in future, and how they might be presented most effectively.

Depending on the availability of data, it may be possible to incorporate the Music Manuscripts collection within the analysis.

Please detail the approach(es) / method(s) you are going to use to implement your idea, detailing clearly the research methods / techniques / processes involved

Indicate and describe any research methods / processes / techniques and approaches you are going to use, e.g. text mining, visualisations, statistical analysis etc.

The project will be a one-off exercise in data mining, statistical modelling and visualisation using three linked datasets relating to the printed music catalogue, book ordering data, and anonymised Reader records. The analysis will involve various forms of linear or non-linear regression, cluster analysis, network analysis, and other techniques depending on the nature of the data and the questions that emerge. Data visualisation will be an important element in the analysis and, in particular, in communication and presenting the results.


Although the datasets are large, the analysis is conceptually straightforward, and can largely be done using three ‘flat’ data files (i.e. just rows and columns) extracted from the music catalogue, the reader records and the book ordering data, with suitable means of linking them together (anonymised Reader numbers and catalogue Record IDs).

Please provide evidence of how you / your team have the skills, knowledge and expertise to successfully carry out the project by working with the Labs team

E.g. work you may have done, publications, a list with dates and links (if you have them)

I have recently completed a PhD in the application of statistical techniques to the study of music history. A copy of my thesis is available at http://oro.open.ac.uk/41851/ . My doctoral research included a detailed assessment of the characteristics of library catalogue data as a source for statistical analysis. More recently I have been pursuing a ‘big data’ analysis of the population of symphonies using a rich dataset of about 7,500 works that I have scraped, merged and cleaned from a variety of sources. Examples of some of my charts and visualisations from this work are available on request. I attended the ‘Big Data History of Music’ data exploration day on 10 March at the British Library, and was awarded a prize for the day’s best visualisation, which was an analysis of publication histories using the BL Music Catalogue data.


I have MAs in both mathematics (Cambridge, 1989) and music (Open, 2009), and am a Fellow of the Institute of Actuaries, so have considerable experience of statistics and mathematical modelling as well as a thorough understanding of musicology. I am proficient in the use of computer software to analyse data. My tools of choice are Excel and R, but I have also used Gephi, Open Refine, Tableau Desktop, and Google Earth / KML.
I am a freelance researcher and there are no calls on my time which would restrict my ability to work on this project.

A list of publications, and further details of my CV, can be found on my LinkedIn page https://www.linkedin.com/in/andrewgustar.

A book on the use of statistics in music history is in preparation, in which I would wish to include some material from this project.

Please provide evidence of how you think your idea is achievable on a technical, curatorial and legal basis

Indicate the technical, curatorial and legal aspects of the idea (you may want to check with Labs team before submitting your idea first).

Technical
The analysis itself is straightforward using tools and techniques with which I am familiar. The main technical issues will be at the start and end of the project: obtaining the data in a suitable form, and presenting the conclusions.

The BL music catalogue is readily available in a convenient form following the ‘Big Data History of Music’ project. I understand that Reader data must be anonymised and summarised in 'big buckets', which should be easily achievable without reducing the usefulness of the results. The Book Ordering Data requirement is for a file of dates, item numbers, and (anonymised) reader numbers, either for all items or just for those ordered for the Rare Books and Music Reading Room. I discussed these requirements with Mahendra Mahey on 10 April but have not yet (as at the date of submission of this form) received confirmation that they can be provided.

The Music Catalogue itself presents various difficulties of inconsistency and duplication, with which I am familiar from my doctoral research. This project will quantify the extent of duplication and inconsistency in the catalogue, and may recommend how these issues can be addressed.

A particular problem with the catalogue data is the rather patchy information regarding musical genre and instrumentation. This information can be obtained by manually sampling a few hundred records from relevant clusters emerging from the analysis. It might also be possible to use machine learning and text mining techniques to automate high level genre attribution for the entire music catalogue, for example based on the occurrence of combinations of key words in the titles of works.

Although I have some experience of producing visualisations of data analyses such as this, I would wish to use the experience and expertise of the BL Labs team to help in presenting the outputs of this project in a creative and engaging way.

Curatorial
I understand that relatively little use has been made of the Book Ordering data and Reader records, and this project would present an opportunity to test the accessibility and usability of this data, to address the requirements about anonymity in using Reader data, and to illustrate some of the ways in which these datasets could be constructively used in showcasing both the BL’s collections and its broader role.

Reader anonymity is of paramount importance and a suitable approach will be agreed with the owners of the data. I would address the need for anonymity with a three stage process, firstly by anonymising all of the reader numbers (replacing each one with a different code, in both the Reader and Book Ordering datasets), secondly by aggregating demographic information into broad categories (e.g. attributing dates of birth to five-year bands, and reducing addresses to post-towns or larger regions), and thirdly by ensuring that none of the outputs of the project mentions anything that might allow a Reader to identify themselves (which might occur if, for example, only one Reader has ordered copies of a particular work that is highlighted for special attention in the report).

Legal
I would work closely with BL staff to ensure that all legal issues are addressed. These will relate primarily to Data Protection and the protection of Intellectual Property. The project will only consider large scale patterns in the data, so none of the intellectual property relating to individual records will be infringed. The BL Music Catalogue data is already in the public domain. Reader records will be anonymised as described above.

Please provide a brief plan of how you will implement your project idea by working with the Labs team

You will be given the opportunity to work on your winning project idea between June 2015 - October 2015.


June 2015

• Working with the BL Labs team to ascertain what data is available, determining exactly what is needed, and extracting files from the relevant systems. Ensure data is anonymised as appropriate. (2-3 days, working with somebody at BL Labs)

• Data linking, cleaning, and initial high level analysis and scoping. (2-3 days, at home)

• On holiday from mid-June to beginning of July.


July 2015

• Detailed examination of overall nature and shape of data, descriptive statistics and high level visualisations to be able to identify areas of interest and plan the analytical approach. (5 days, at home)

• Subject to the above, begin work on regression, clustering and network analysis (5 days, at home)

• Research analytical tools and visualisation techniques to be best able to present the sort of results that are emerging. (2-3 days, at home)

• Progress report with initial conclusions and visualisations, and discussion with BL Labs team to agree areas of interest and next steps (1 day, at BL)


August 2015

• Detailed analysis as agreed above. (10-15 days, at home)

• Sampling and manual checking of records in order to obtain genre information (5-10 days, at home)

• Discussion with BL Labs team about visualiation approach and appropriate tools (1 day, at BL)

• Progress report and agree next steps with BL Labs (1 day, at BL or via email/phone)

• One week holiday


September 2015

• Continuation of analysis from August (2-5 days, at home)

• Create initial visualisations (3-5 days, at home)

• Prepare presentation to judging panel and discuss/refine it with BL Labs (perhaps 5 days at home plus 2 at the BL)


October 2015

• Working on final report and presentation. Refine and produce final versions of visualisations. Link to contextual data where relevant (e.g. digital resources, sound files, other sources). (5-10 days, at home or in BL, some liaison with BL Labs team)

Subsequently

• Work with someone at BL Labs on a paper for a suitable academic journal (15-25 days)