Git-Lit: A Project to Parse, Version-Control, and Post British Library Digital Texts (Category: Research)

Name of Submitter(s): Jonathan Reeve

Organisation: Columbia University

The Git-Lit project parses, version controls, and posts Git repositories for the British Library's ALTO XML documents. This has a number of effects useful to digital humanities scholarship. First, it allows for open collaborative editing of the texts, in order to crowdsource the correction of OCR errors and encourage the creation of new, decentralized scholarly editions. Second, it provides archival records of all changes through the use of modern version control systems. Third, it facilitates computational text analysis by enabling the automated retrieval and collection of large numbers of disconnected texts. The scripts used by Git-Lit transform the XML metadata for British Library texts into human-readable prefatory documents, version controls these documents along with the original texts, and pushes the resulting git repositories to GitHub. The resulting text repositories are then available for collaborative editing, using the fork/revise/pull request model long used by software developers.
Future phases of the project will involve programmatically indexing the texts according to date and genre, and collecting these texts into parent repositories using git submodules. Text analysts would then be able to download a corpus of 19th Century Bildungsromane, for instance, simply by running two short commands: one for git-cloning a parent repository, and another to recursively initialize its submodules. This would constitute a revolutionary improvement in the retrieval speed of these corpora, and would allow the texts to be non-linearally curated into special interest categories.
A subsequent phase of the project will automatically generate human-readable ASCIIDOC and richly annotated TEI XML versions of each text. The ASCIIDOC versions will be automatically rendered into web pages by GitHub, increasing exposure for the texts and encouraging collaborative editing; the TEI XML versions will allow for an archival-quality edition of the text that contains semantic markup of the text's literary features.
The scripts for the project are currently operational with the four sample texts provided by James Baker of the British Library, and the output of these scripts can be found at the GitHub organization "Git-Lit." The project is currently awaiting access to the larger corpus of ALTO XML files.
URL for Entry: http://jonathanreeve.github.io/git-lit/

Email: jon.reeve@gmail.com

Twitter: j0_0n

Job Title: Graduate Student, Programmer

Background of Submitter:

I am a first-year graduate student in English at Columbia University, specializing in digital humanities. I have worked for the past two years as a programmer and web developer for the Modern Language Association, where I contributed code to the digital editions of the Literary Research Guide, Literary Studies in the Digital Age, and the forthcoming Digital Pedagogy in the Humanities. My publications include a forthcoming chapter on James Joyce's A Portrait of the Artist as a Young Man in the volume Reading Modernism with Machines. Among my other digital projects are:
- Annotags, a protocol for decentralized literary annotation, which allows users to encode bibliographic information into a hashtag suitable for use in microblogging services such as Twitter. http://jonreeve.com/projects/annotags/about.html
- The Macro-Etymological Analyzer, a text analysis web app that computes the macro-etymological language origin of a given text. http://jonreeve.com/2013/11/introducing-the-macro-etymological-analyzer/
A more complete list is available at http://jonreeve.com/cv/

Problem / Challenge Space:

Roughly, Git-Lit addresses three issues:
1. Electronic texts are difficult to edit. There does not yet exist an efficient, streamlined way to improve the quality of an electronic texts. What is needed, therefore, is an open-source, decentralized model for community-centered editing. This model already exists for software development in the form of git. By posting a text to GitHub, we can take advantage of the fork/revise/pull request workflow that programmers have long enjoyed for software collaboration.
2. Textual corpora are difficult to assemble. With some exceptions (notably the NLTK corpus module), downloading a text corpus involves compiling texts from any number of heterogeneous sources. A would-be text analyst must click through a series of web pages to find the corpus he or she wants, and then either download a .zip file that must be expanded, or email the corpus assembler for a copy of the corpus. With multiple texts, this can be a labor-intensive process that is not easily scriptable or automated. Git provides an easy way to solve these problems. By making texts available through the git protocol on GitHub, anyone that wishes to download a text corpus can simply run `git clone` followed by the repository URL. Parent repositories can then be assembled for collections of texts using git submodules.
3. ALTO XML is not very human-readable. ALTO XML, the OCR output format used by the British Library, the Library of Congress, and others, is extremely verbose. It encodes the location of each OCRed word, and often gives the OCR certainty for each word. This is great for archival purposes, but isn't an ideal starting-point for the kinds of text analysis typically done in the digital humanities. What we need is a script to transform this verbose XML into a human-readable format like ASCIIDOC that maintains as many of the original features of the text as possible.

Approach / Methodology:

The project uses an IPython Notebook to conduct text transformations, assemble new prefatory documents with templates, interface with CLI git commands. and interface with the GitHub API. The notebook contains three main classes: one to extract metadata from a British Library ALTO XML text, one to initialize a local repository for that text and add README and CONTRIBUTING documents, and one to push the resulting local repository to GitHub. Portions of the code have been adapted from the GITenberg project, a similar initiative to transform and post Project Gutenberg books.

Extent of showcasing BL Digital Content:

At the moment, the project exclusively deals with British Library texts. Although only the four sample texts are currently showcased, the project hopes to use the full corpus of ALTO XML texts.

Impact of Project:

The project has only been active for the past month, so it has not had many opportunities for conference presentations or publications. However, it has received an honorable mention from the 2015 NYCDH Graduate Student Project Awards.

Issues / Challenges faced during project(s):

The project's main challenge is the lack of development hours available. As a full-time graduate student, I have very limited time with which to work on this project. To help mitigate this issue, I am soliciting help from other software developers, particularly those involved in the GITenberg project.