| Competition | Previous Entries & Ideas | Digital Collections | TOCs | FAQs | Judging | Resources and Tools | Submit Entry | Events |

Pieter Francois

Winner's Blog Post

Submitted Entry for 2013 Competition

Abstract

The ‘Sample Generator for Digitized Texts’ is a relatively simple piece of software which connects one or more major catalogues or bibliographies with one or more collections of digitized texts through the metadata. The ‘Sample Generator’ allows users to create custom-made samples of fully digitized texts that mirror the distribution of certain key parameters, like genre, year and place of publication, language, gender of the author, ..., as found within the catalogues and bibliographies. Whereas the applicability of the ‘Sample Generator’ is universal, this proposal, given the tight framework of this competition, will focus on testing out this novel approach by connecting the nineteenth-century holdings of the Integrated Catalogue of the British Library with the ‘19th Century Books’ digital collection. Modifying the baseline of the BL Labs, the main aim of this entry is therefore to tell the story of over a million nineteenth-century books through a structured sampling of 68,000 books.

Assessment Criteria

The research question / problem you are trying to answer*

Please focus on the clarity and quality of the research question / problem posed:

Context
How representative are the historical texts humanities scholars study of the overall body of ‘surviving’ texts that are held in the various library collections? Humanities scholars have devised a range of valuable methodologies to tackle the fundamental yet thorny issue of the representativeness of their sources. Most of these methodologies inevitably hinge on a series of valid practical considerations. These include the amount of research time a researcher can spend on the study of the sources, the proximity of the researcher to the major relevant libraries and collections and, increasingly, the level of access researchers have to digital collections. The ‘digital age’ offers ample possibilities to add substantially to this existing range of methodologies. The availability of large digital collections makes it possible to tackle issues of representativeness in a methodologically coherent and robust manner. I firmly believe that the introduction of advanced sampling techniques will play a crucial role in shaping these new methodologies. The novel concept of a ‘Sample Generator for Digitized Texts’ (or ‘Sample Generator’), as introduced in this proposal, has the potential to play a crucial role in putting these methodologies on track as it invites humanities scholars to think about representativeness and sampling in a fascinating new way.

General idea
The ‘Sample Generator’ is a relatively simple piece of software which connects one or more major catalogues or bibliographies with one or more collections of digitized texts through the metadata. The ‘Sample Generator’ allows users to create custom-made samples of fully digitized texts that mirror the distribution of certain key parameters, like genre, year and place of publication, language, gender of the author, ..., as found within the catalogues and bibliographies. Whereas the applicability of the ‘Sample Generator’ is universal, this proposal, given the tight framework of this competition, will focus on testing out this novel approach by connecting the nineteenth-century holdings of the Integrated Catalogue of the British Library with the ‘19th Century Books’ digital collection. Modifying the baseline of the BL Labs, the main aim of this entry is therefore to tell the story of over a million nineteenth-century books through a structured sampling of 68,000 books.

Description

The use of the ‘Sample Generator’ opens up an additional source base upon which humanities scholars can base their claims and arguments. It will be possible to anchor in a well-documented and methodologically coherent way long-term trends and change over time that is manifest in the digital collections into much larger collections and bibliographies. These larger collections and bibliographies are usually much better proxies for the overall publication landscape. Adding this second source base allows users to ask both novel research questions and answer existing research questions with greater confidence. For example, users of the ‘Sample Generator’ can create a sample of fully digitized books that reflects the change over time in the yearly number of publications as is evident in the entire nineteenth-century holdings of the Integrated Catalogue of the British Library. In addition it will be possible to further customize this sample and have the sample reflect for every year of the nineteenth century the ratio between books published in ‘London’, ‘Britain, but excluding London’ and ‘not Britain’. Further custom-made sampling layers regarding genre, language of publication, gender of the author, publisher, ... can be added if relevant for the research questions the user intends to answer.

Finally, and most crucially, it will be possible to choose the size of the sample. For example, the fullest possible sample which reflects for every year of the nineteenth century the gender ratio of authors as is evident in the Integrated Catalogue will most likely use the vast majority of the 68,000 digitized books of the ‘19th Century Books’ collection. The loss of a few thousand (or possibly even only a few hundred) of digitized texts when creating this largest possible sample would be more than compensated by the possibility to link the research findings to one of the world’s largest collections of nineteenth-century books. Such a custom-made adjustment of the sample size (or scaling of the sample) offers users also an important conceptual and methodological framework to integrate different research styles within one study. Adding a scaling functionality to the basic model of the ‘Sample Generator’ contributes considerably to the potential of the ‘Sample Generator’. Very large sample sizes invite the use of ‘distant reading’ techniques and tools, whilst it is possible to study smaller samples through ‘close reading’. By creating multiple samples that have different sizes yet follow the same set of parameters or values for genre, year of publications, ..., it is possible to address the same set of research questions at different sample levels and thus to integrate within one study complementary methodologies. Using the ‘Sample Generator’ does therefore not exclude the use of any other research methodology. In fact, the potential of the ‘Sample Generator’ is strongest when combined with other research tools, such as advanced text mining, GIS and network analysis tools.

Case-study

Convincing large groups of humanities scholars of the potential of the ‘Sample Generator’ and its methodological robustness will be key to the viability of this project in the long term. In my opinion this important task of convincing humanities scholars can only be achieved by testing out the ‘Sample Generator’, and its different features and functionalities, with real data. In addition to creating a basic model of a ‘Sample Generator’, which is technically a fairly straightforward task, most time will be spend on using the ‘Sample Generator’ to revisit some important historiographical debates on the history of nineteenth-century travel. Given my own background in this field and the strong focus of the ‘19th Century Books’ collection on travel, the aim is to highlight the potential of the ‘Sample Generator’ by addressing some well-known debates through testing a series of relevant predictions. For example, opinions differ whether the publication of the first Murray travel guide resulted in a drop in the number of published travel guides or whether the introduction of the railway resulted in travel guides promoting more standardized travel routes. Specific arguments related to these and other topics are usually supported by enlisting a range of quotations from different travel guides. As such strategies are prone to the criticism of cherry picking, these debates are rarely settled. The use of the ‘Sample Generator’ would allow to weigh in on these and similar debates in a more authoritative and methodologically coherent way. Furthermore this case-study of nineteenth-century travel will generate plenty of examples of good practice on how to use the ‘Sample Generator’ in a “sensible” way. This is especially important as I firmly believe that the ‘Sample Generator’ will only be as strong/convincing as the set of instructions that will accompany it. It is therefore essential to put together a well-documented and rich ‘user manual’.

Please explain the ways your idea will showcase British Library digital collections*

Please ensure you include details of British Library digital collections you are showcasing (you may use several collections if you wish), a sample can be found at http://labs.bl.uk/Digital+Collections

The core digital collection the ‘Sample Generator’ is built around is the ‘19th Century Books’ collection which contains over 68,000 fully digitized books. As outlined above, the use of the ‘Sample Generator’ will enhance considerably the potential of this already fascinating collection. Furthermore the ‘Sample Generator’ also makes use of the nineteenth century holdings of the Integrated Catalogue as it is the metadata of the nineteenth century holdings that will structure the samples. Finally, for cross-validation purposes, it would be great if the ‘Sample Generator’ could also make use of the ‘Bibliography of British Travel Writing, 1780-1840’, which is put together by Benjamin Colbert.
Please detail the approach(es) / method(s) you are going to use to implement your idea, detailing clearly the research methods / techniques / processes involved*

Indicate and describe any research methods / processes / techniques and approaches you are going to use, e.g. text mining, visualisations, statistical analysis etc.

This entry makes use of a range of research methods. Firstly, an API has to be created which connects all the metadata and has a querying functionality similar to that in a relational database environment. This API also needs a sampling functionality. Secondly, testing the functionality of the ‘Sample Generator’ with data involves performing some statistical analyses for which either the R or Matlab environment will be used. Thirdly, as the analysis of travel data has an important spatial dimension, this entry makes use of specialised GIS software (potentially ArcGIS, although both the R or Matlab environment can also be used). Fourthly, for the analysis of network data, like the travel route data, network analysis tools like Gephi will be used. The extent to which the latter two research methods will be used depends heavily on the extent to which the travel case study can be developed. As this represents a major time commitment on short notice, the extent to which the case-study can be developed will depend largely on the level of support the BL Labs can offer (see below).

Please provide evidence of how you / your team have the skills, knowledge and expertise to successfully carry out the project by working with the Labs team*

E.g. work you may have done, publications, a list with dates and links (if you have them)

For this project I combine both strands of my academic background. Firstly, I trained as a historian of nineteenth century travel (PhD in 2006, Royal Holloway, University of London). I have published extensively on the topic of nineteenth century British travellers to the Continent. These publications include a monograph ‘’A Little Britain on the Continent’. British Perceptions of Belgium 1830-1870’ (Pisa University Press, 2010) and several articles of which ‘If It’s 1815, This Must Be Belgium: The Origins of the Modern Travel Guide’ (Book History, 2012) and ‘’The Best way to See Waterloo is with Your Eyes Shut’. British ‘Histourism’, Authenticity and Commercialization in the Mid-Nineteenth-Century’ (Anthropological Journal of European Cultures, forthcoming) are most representative of my research interests.
My current job is in the second academic field that I am interested in, i.e. evolutionary and cognitive anthropology. I am above all interested in devising novel ways to open up (often gappy) historical data for sustained statistical and spatial analysis and to use these historical datasets to test and inform cognitive and evolutionary theories. As a result of the expertise I have built up in this emerging field on the intersection of the Social Sciences and the Humanities, I am currently involved in three multi-centre collaborative projects, including the ‘Seshat. Global History Databank’ project (Oxford University), of which I am a co-founder, the ‘Evolution of the Origin of Morality and Religion’ project (University of British Columbia), and a project focusing on the earth’s historical carrying capacity (University of Connecticut). For a paper on the methodology of the first project, see the following article I co-authored: http://www.escholarship.org/uc/item/2v8119hf# . This entry brings together my interest in data, statistics, and nineteenth century travel. I am also looking forward to bring the experience I have built up in working in fully collaborative environments (including on issues as ‘credit’ and authorship) into a humanities environment.

Please provide evidence of how you think your idea is achievable on a technical, curatorial and legal basis*

Indicate the technical, curatorial and legal aspects of the idea (you may want to check with Labs team before submitting your idea first).

Technical

In terms of the technical feasibility of this entry, it is important to separate two very different parts of the ‘Sample Generator’. Designing a basic model of the ‘Sample Generator’ is relatively straightforward, especially as it should be possible to reuse sections of code from a range of API’s. Adding a fully developed scaling functionality to the ‘Sample Generator’ is a very different issue. It is important to be upfront on what is feasible to accomplish in the short time span of this competition. In my opinion it will only be possible to set the crucial first steps towards a full implementation of that feature. Given that the scaling functionality is the most promising, yet fragile, part of the ‘Sample Generator’ it is crucial that these first steps are made early on. There are several possible routes toward implementing the scaling functionality. One route would be to invest all the time in devising a high-end ‘Sample Generator’ which includes the scaling functionality. However, there would be no time left to test the ‘Sample Generator’ with data and most time would be spend tackling some moderately complicated statistical issues (see below). The other route, and the one favoured in this proposal, is to put together a basic model of a ‘Sample Generator’ and test it out in a comprehensive way with data and use this material for a first draft of the all-important ‘user manual’. My preference for this second route is also a strategic choice as the success of the ‘Sample Generator’ will initially hinge on convincing a sufficient number of humanities scholars of its potential. The second route is much more suited to achieve this goal. A too strong and exclusive focus on statistical issues at the initial stages will in my opinion backfire. For example, in order to limit instances of “misuse” of the ‘Sample Generator’ it will be essential to provide clear instructions on appropriate minimum sample sizes for different types of research questions. This needs to be fleshed out by performing a series of statistical tests. The sticky point is that there will be a discrepancy between statistically valid cut-off points on the one hand and the opinion of large sections of potential users on the other. Inevitably there will be a grey zone in which data will still be useful when approached through confidence intervals. Yet users might be reluctant to look at data in this way as many academic disciplines have no tradition to do so. (To be more precise, this is inherent to any form of sampling, no matter the size of the sample. However when simply “eyeballing” or “visualizing” the data in a straightforward manner the use of larger sample sizes creates a false sense of accuracy and in some humanities disciplines this reinforces the tradition to gloss over the need to understand data through statistics). Whereas this and other technical discussions are vital for the viability of the ‘Sample Generator’ in the long term, I fear a too strong a focus on these relatively time consuming issues will be detrimental for our effort to create a platform of support for the ‘Sample Generator’ among humanities scholars. Yet, once the potential of the ‘Sample Generator’ is clearly demonstrated with the help of (seemingly) straightforward data from the travel case study, it will be much easier to deal with the statistical issues.

Curatorial

The ‘Sample Generator’ would make use of in-house data of the British Library. As mentioned above, the ‘Sample Generator’ links the metadata of the nineteenth-century holdings from the Integrated Catalogue with the metadata from the ‘19th Century Books’ collection. Linking these two sets of metadata is fairly straightforward as both sets of metadata contain a unique Aleph number for each book (although inevitably minor issues relating to multiple editions and multi-authored works will have to be ironed out). The metadata of the Integrated Catalogue holdings is accessible in two ways. Firstly, it is possible to access the metadata using an existing API (on site only). Secondly, the BL Labs offer the use of a local dump of the metadata, again for on-site use only. The latter option is by far the most useful one during the initial try-out phases. For the final version of the ‘Sample Generator’ it is advisable to use the existing API. The metadata of the ‘19th Century Books’ collection is already in my possession in the format of an Excell spreadsheet. Although connecting this set of metadata efficiently with the corresponding OCR files needs to be further streamlined, it is clear that this will be resolved by August when the OCR content should first be accessed.

Legal

The ‘Sample Generator’ makes only use of nineteenth century books that are out of copyright. As mentioned before, for cross-validation purposes it would be great to link up with the ‘Bibliography of British Travel Writing, 1780-1840’ project. Further contact with Benjamin Colbert will be made to explore this possibility.
Please provide a brief plan of how you will implement your project idea by working with the Labs team*

You will be given the opportunity to work on your winning project idea between July 6th - October 31st 2013
There are two important ways in which the BL Labs can play a crucial role in furthering this project. Firstly, and most importantly, I hope the BL Labs can provide me with a supportive environment to let the idea mature. This support can take several forms. I am especially keen to put together a focus group interested in digital collections and sampling. Such a group would not only enrich my own thinking and would allow the idea to mature further, it would, hopefully, also put me in touch with potential collaborators who can help shaping the next phases of the overall project. Potential future spin offs include putting together a high end and statistically sophisticated ‘Sample Generator’ for the ‘19th Century Books’ collection, using a ‘Sample Generator’ on other digital collections (whether in the BL or not), support future grant applications, ... . I definitely hope members of the advisory board would be interested to perform such a mentoring role.

In practical and more immediate terms there are numerous ways the BL Labs can support the ‘Sample Generator’.

Firstly, I am hoping that the technical lead of the BL Labs can take care of the IT-aspects of creating the basic model of the ‘SampleGenerator’. Secondly, the better and more comprehensive the travel case-study is worked out, the easier it will be to highlight the potential of the ‘Sample Generator’. I am especially interested in enrolling the help of a part time RA (preferable historian) who, in tandem with myself, can gather specific travel data that is required to test the hypotheses. In addition, I am also keen of getting a limited amount of help for conducting the statistical analysis and creating some of the GIS and network analysis visualizations. Although any support would be greatly appreciated and enhance considerably the richness of both the case-study and the ‘user manual’, this support is not a necessary condition for the viability of this entry as I can scale down the case-study and do the analyses myself.

Fullest wish list of potential collaborators:

Pieter Francois - applicant (PF)
Mahendra Mahey (MM)
BL Lab Technical Lead (TL)
Curators of the nineteenth-century holdings (Curators)
Digital curators of the ‘19th Century Books’ collection (Digital Curators)
Research Assistant working on travel literature (RA 1)
Research Assistant performing a limited number of statistical tests (RA 2)

Wider BL Lab community (Lab)

July 6 - Onwards
• Discussing and developing the basic idea within the context of the BL Labs (PF, LAB)
• Taking stock of the selection criteria used for creating the ‘19th Century Books’ collection’ (Digital Curators, PF)
• Discuss further with Benjamin Colbert the possibility to include the ‘Bibliography of British Travel Writing, 1780-1840’ (PF)
• Drafting a full list of predictions related to nineteenth century travel that can be tested by using the ‘Sample Generator’ (PF, potentially with the help of leading travel historians like James Buzard or Marjorie Morgan)
• Getting acquainted with the local dump of the metadata of the Integrated Catalogue (TL, PF and Curators)
• Identifying people for the two research assistant positions (I have two people in mind, but I am flexible) (MM, PF)

August 2013
• Creating a first extremely basic version of the ‘Sample Generator’ (TL, PF, Curators, Digital Curators)
• Start gathering the travel data necessary to test the hypothesis (RA 1, PF)
September 2013
• Gathering the travel data necessary to test the hypotheses (RA 1, PF)
• Perform a set of statistical analyses and create some initial visualizations based upon the first batch of travel data (PF, RA 2)

October 2013
• Finish gathering the travel data necessary to test the hypotheses (RA 1, PF)
• Finalize the statistical analyses and visualizations of the travel data (PF, RA2)
• Publish an article focusing on the potential of the ‘Sample Generator’ by showcasing its impact on historiographical debates on nineteenth-century travel (PF as first author, RA1, RA2, TL, MM, and all the other Lab members who had a significant input)
• Present the result to different audiences. This can be achieved by presenting the work on conferences, seminars, online, in the media ... . (PF, and hopefully all other collaborators)
• Plan the further stages of the project, including discussing a) how to upgrade the existing basic ‘Sample Generator ,

b) putting together a comprehensive ‘user manual’
c) using a ‘Sample Generator’ for other digital collections and catalogues
d) write further articles (especially a methodological one)
e) applying for future funding (PF, Lab)