| Competition | Previous Entries & Ideas | Digital Collections | TOCs | FAQs | Judging | Resources and Tools | Submit Entry | Events |

Steven Gray

Submitted Entry for 2013 Competition

Abstract

Existing textual analysis platforms provide statistical analysis of single documents using either online resources or desktop applications. This analytical software provides the user a direct overview of the documents word count, frequency of words and various other metrics but researchers must pull together results for multiple sources manually. Portable devices such as smartphones and tablets are commonly used for communication but are increasingly used for data digestion, creation and now data discovery. Still the problem exists that entire collections remain locked away and are computationally difficult to process due to their size.

Textal, a prototype smartphone application created at UCL CASA (Centre for Advanced Spatial Analysis) and UCL Digital Humanities has opened up textual analysis to the general public and researchers by providing tools allowing users to visualize text documents quickly and easily through the use of word clouds and export results for further analysis. Users can then dive deeper into the analysis of each word in the document by interacting with the word cloud through the device’s touch interface.

This proposal aims to create a new version of Textal allowing the readers inside the British Library to pull together collections of text documents held by the Library, through an iPad application, to visualise entire collections via an interactive, easy to use, interface. These word clouds would be fully searchable and the system would provide a view to look at visualizations created by other users. Users of the application can then query different data sets and build custom queries to gain new insights into collections of documents.

This new system would be support by the open source toolkit, the Big Data Toolkit (http://www.bigdatatoolkit.org) currently in development at UCL CASA allowing the project to use distributed processing techniques to analyse complete collections.

URL
http://www.textal.org

Assessment Criteria

The research question / problem you are trying to answer

Please focus on the clarity and quality of the research question / problem posed:

We aim to research new methods of visualizing and mining large scale data collections which normally would not be possible using desktop computers or single servers. By utilising distributed computing, large scale distributed databases (Hadoop, or BigTable) and Graphical Processing Units (GPU’s) we aim to create a workflow that will capable of processing a collection of documents in real-time. This new method would then be benchmarked for speed and efficiency and finally documented in my own PhD Thesis due to be completed Decemeber 2014.

Please explain the ways your idea will showcase British Library digital collections

Please ensure you include details of British Library digital collections you are showcasing (you may use several collections if you wish), a sample can be found at http://labs.bl.uk/Digital+Collections

The project plans to use the Electronic Journals dataset as the base set of data for analysis. With over 8000 subscriptions of digitised articles available for data mining, this dataset will serve as the test data set to build a proof of concept system. In the future we hope to extend this system to include other digital datasets that the library curate, for example the 19th century books. The system would allow readers to compare writing styles in separate journals or, for example, the change of subject matter through the years and also provide insight into the vast scale of the collection.

Please detail the approach(es) / method(s) you are going to use to implement your idea, detailing clearly the research methods / techniques / processes involved

Indicate and describe any research methods / processes / techniques and approaches you are going to use, e.g. text mining, visualisations, statistical analysis etc.

We aim to carry out text analysis on complete collections and visualise the content of these collections through the iPad application. We will use a mixture of text mining, statistical pattern learning and clustering to identify documents that are related to each other within the collection to provided the user with some pre-populated queries while using the application.

Firstly, we will identify a workflow using a single machine with a high end GPU unit to concurrently analyse the collection and run tests to see if we can serve real-time requests from the client using this hardware setup. The GPU equipment will be sourced using NVidia Academic partnership program (https://research.nvidia.com/content/academic-partnership-program )

Secondly, we will investigate the parallelisation of processing text documents in such a system using distributive computing techniques found in large-scale data warehouse systems. We will seek to apply for Amazon Education Program (http://aws.amazon.com/grants/ ) to utilise their cloud computing platform dynamically.

Please provide evidence of how you / your team have the skills, knowledge and expertise to successfully carry out the project by working with the Labs team

E.g. work you may have done, publications, a list with dates and links (if you have them)

The project software lead, Steven Gray, is an active Software Developer and Researcher at UCL Center for Advanced Spatial Analysis. His specialised research area is Mobile Human Computer Interaction, large-scale data analysis and visualisation. He has built various mobile applications that are available to download on the App Store that are detailed below.

Textal: https://itunes.apple.com/us/app/textal/id646764497

MapTube: https://itunes.apple.com/us/app/maptube/id648018647

ERSA2013: https://itunes.apple.com/us/app/esra-2013/id662551560

QRator: http://www.qrator.org

A full list of papers his papers can be found on Google Scholar.

http://scholar.google.co.uk/citations?user=9EEwOY8AAAAJ&hl=en

The project will seek the expertise of Dr. Melissa Terras, Director of UCL Digital Humanities and Dr Andrew Hudson-Smith, Director of UCL CASA.

Dr. Melissa Terras‘s research interests involve applying computational technologies to Humanities problems, to allow research that would otherwise be impossible. Her current projects include QRator, Textal, and Transcribe Bentham.

Dr. Andrew Hudson-Smith is Editor-in-Chief of Future Internet Journal, an elected Fellow of the Royal Society of Arts and Course Founder and Director of the MRes in Advanced Spatial Analysis and Visualisation at University College London. With a focus on location based digital technologies he has been at the forefront of Web 2.0 technologies for communication, outreach and developing a unique contribution to knowledge.

Please provide evidence of how you think your idea is achievable on a technical, curatorial and legal basis

Indicate the technical, curatorial and legal aspects of the idea (you may want to check with Labs team before submitting your idea first).

Textal, the application which this proposal is based, is an active project that is available on the App Store and is visualizing user contributed textual data, from project Gutenburg Books, web pages and social media conversations. This app provides as a evidence to this proposal and proves that the system works on small-scale textual documents.

Please provide a brief plan of how you will implement your project idea by working with the Labs team

You will be given the opportunity to work on your winning project idea between July 6th - October 31st 2013

July 6 - Onwards
Activity described here (e.g. what, when and by who)

SG+BLLabs: Requirements capture with the BL Labs Team on Electronic Journal dataset. Feasiblity study of providing iPads within the library.

BLLabs: Identify and create subset journal collections, which have mining rights.

August 2013
Activity described here (e.g. what, when and by who)
SG: Build in-house system for text-analysis from exsisting Textal Project
BL: Provide in-house hardware systems to support project
BL-Labs: Provide JSON objects of collections for indexing
BL-Labs: Provide JSON objects of collections for mining
BL-Labs: Provide end points to capture individual documents

September 2013
Activity described here (e.g. what, when and by who)
SG: Build Textal iPad application to connect backend systems and interface using current Textal project as core code base.
BL-Labs: Provide feedback on application design

October 2013
Activity described here (e.g. what, when and by who)
BL-Labs + SG: Testing application and checking compliance with mining rights