Luda Zhao, Brian Do and Karen Wang

Submitted Entry for 2016 Competition

Abstract

Book illustrations, such as those in the British Library 1M collection, provide valuable insights into the cultural fabric of their time. However, large image collections are only useful and discoverable by researchers if they can be deeply explored via text. Annotation of images with factual tags and descriptive, useful image captions has traditionally required human annotators, which cannot be scaled to millions of images. Here we propose to apply convolutional neural networks (CNNs), a state-of-the-art deep learning algorithm, to automatically tag and caption the entire British Library 1M collection. CNNs and deep learning have improved dramatically in the past decade and are now the tool of choice for generating knowledge from images and other multidimensional content.
We plan to develop and optimize CNN algorithms to accomplish two important tasks: tagging and captioning. In the first step, we will classify all images with general categorical tags (e.g. decorations, architecture, animals, etc.). This will serve as the basis for us to develop new ways to facilitate rapid online tagging with user-defined sets of tags, including a tag suggestions interface that would continuously reward user input by doing on-the-fly online learning to improve model accuracy. In the second, we will build off recent result in Deep Learning research automatically generate descriptive natural-language captions for all images (e.g. “A man in a meadow on a horse”).
We intend to make our tags and captions accessible and searchable by the public through a web-based interface. We also plan to provide supplementary visualization tools to give researchers greater insight and access into the technical assumptions behind our work. Finally, we will use our text annotations to globally analyze trends in the BL collection over time. Together, we hope our project will establish CNNs as a novel tool for image annotation and analysis, as well as encourage widespread adoption of neural networks in bibliology.

URL: http://cs231n.stanford.edu/reports2016/204_Report.pdf

Assessment Criteria

The research question you are trying to answer

Overall research question:
How can we apply convolutional neural networks (CNNs) to massively improve discoverability in the British Library 1M collection?

Classification:
1. Can we build an accurate model using CNNs that will be able to distinguish between decorations, people, maps, etc., based on supervised learning?

2. Since we will be using transfer learning from a pre-existing ImageNet model(Inception-v3), how well can a neural network trained on contemporary data perform when it is applied to a non-standard task, such as classification of illustration images?

3. Can we develop a computational representation of images so that we can quickly find images that match any tag, not just the 12 tags we started with?

4. Can we build flexible models for doing on-the-fly online learning to improve tagging suggestions?

Captioning:
1. How can we pair a recurrent neural network (RNN), used primarily in language understanding and translation, and a CNN, used in computer vision, to both understand the content of an image and then generate relevant, grammatically sound captions for that image?

2. How can we use context text such as OCR of the surrounding text to improve the quality of generated text?

3. What is an effective evaluation metric to measure the accuracy of the generated captions?

Presentation of Results:
1. How should we surface our tags and captions in a way that is easily usable by other researchers?

Please explain the extent and way your idea will showcase British Library digital content

Our project will directly improve the discoverability and searchability of The British Lab Flickr Image Collection. The dataset is a vast collection of valuable book illustrations from the 15th century to the 19th century, but outside of searching individually contributed Flickr tags (which cover less than 10% of the dataset), there are few direct ways to discover individual images based on their content. E. Crowley, et. al. has already initiated progress in the domain with an engine that trains classifiers on-the-fly using Google Images to find images containing a specific object or pattern. Building upon this goal, we hope that our tags and natural-language captions will enable researchers and communities to more quickly and intuitively search the dataset for specific images in order to investigate their topics of interest--whether it be compiling map collection, extracting ideas for artistic renditions, or searching for key insights for historical trends. We also hope to improve the tagging process by researcher by using our CNN model to provide useful tag suggestion mechanisms that are automatically customized to the researcher’s specific interests.
Additionally, we hope that this project could be a proof of concept that demonstrates the potential of machine learning to revolutionize digital historical research. Artificial neural networks are increasingly being applied to diverse domains, and we hope this project can inspire other researchers to apply similar techniques to improve digital historical research.

Please detail the approach(es) / method(s) you are going to use to implement your idea, detailing clearly the research methods

Recent advances in neural network research and computational resources will make our project feasible. We will use TensorFlow, a contemporary open-source deep learning framework, to construct our algorithms, and we will be using graphics processing unit (GPU)-enabled machines on Amazon Web Services to perform all computational tasks.
For classification, we will first classify 1,500 images into one of 12 categories (animals, architecture, decorations, landscapes, nature, people, miniatures, text, seals, objects, diagrams, and maps). We determined these categories through personal communication with Adrian Edwards, a curator at the British Library, whom we consulted to determine the most useful general-purpose tags for the collection. Each category will have at least 100 images. We will feed these images into a pretrained InceptionV3 CNN running in TensorFlow. InceptionV3 was trained by the Google deep learning team to achieve over 96% performance on a contemporary 1000-category image classification task. Using the 1.5K model, we will then classify 10,000 additional randomly chosen images into one of the above 12 categories. We will verify these tags manually and correct as needed. We will then use these newly trained images to train a final CNN, which will be used to provide tags for the entire dataset.

Since our 12 categories are inherently limiting, we will also explore ways to develop new computational representations of images that can enable flexible searching. For instance, the above CNN algorithm represents each image as a sequence of 4,096 numbers (a “code”), then learns how to read this code to determine which of 12 tags best represents the image. We intend to explore modifications and extensions to this “code” to allow tagging beyond these 12 tags.
We also intend to explore ways to provide tagging suggestions that are based off our CNN model. We plan to implement an on-the-fly training model that would steadily take small training steps using custom user-inputted tags to improve the accuracy of further suggestions. For example, For the tag “bicycle”, the tool would surface sets of pictures that it believes is likely to be bicycles based on a small seed dataset, and every input it receives thereafter will be added to the training data which will further improve the accuracy of the suggestions.

For the image-captioning part of the project, we will combine the existing CNN structure with a recurrent neural network (RNN). RNNs are a novel deep learning framework that can generate ordered output, such as English sentences. First, we will build a training set of captions through various approaches including direct manual labeling, volunteer generation of captions, Amazon Mechanical Turk, OCR (optical character recognition) of text from the surrounding pages of the book, or direct OCR of text captions in the image. We will experiment to determine the exact features we want to use as basis of our captions. The ground truth phrases will be fed into word2vec, which will be used to extract useful feature vectors, which in combination with the high-dimensional features from the CNN will allows us to generate coherent and relevant captions. Once we have sufficiently tuned our model on the training set with high enough accuracy, we will train the CNN-RNN hybrid neural network to generate captions from images with no captions. The captions that the model are most confident in will be surfaced through the online interface.

Please provide evidence how you / your team have the skills, knowledge and expertise to successfully carry out the project by working with the British Library

We are senior undergraduate/graduate students in the Computer Science department at Stanford, with prior experience in machine learning, natural language processing, and computer vision. This project is a continuation of a neural network class project that we have been working on for the past three months, and we have already made significant progress in the work that we have outlined in this proposal, especially the tagging portion of the project.

Here is a link to our paper, which details our completed work: http://cs231n.stanford.edu/reports2016/204_Report.pdf

We’ve also presented our completed work at a poster session at Stanford, and received technical feedback which we will incorporate in our work ahead. Additionally, we have established connections to Stanford faculty members and members of the British Library Labs team, whom we could reach out to for advice on both technical and non-technical challenges of the project.

Please provide evidence of how you think your idea is achievable on a technical, curatorial and legal basis

Technical
As mentioned previously, we have made progress on building neural networks to classify images in the collection, and we plan to leverage the same infrastructure to complete the other parts of the project. Automatic image-to-text-caption work has been done with a RNN+CNN architecture by research labs at Stanford in the past year (as detailed here: http://cs.stanford.edu/people/karpathy/sfmltalk.pdf), and we will be following their methods closely to implement our image captioning model. With monetary support from the British Library, we will be able to leverage Amazon’s powerful cloud-computing resources which are the standard for cutting-edge neural network research.

Curatorial
All of the digital collections required for this project are accessible on site at the British Library. We will be working closely with the BL Labs team to obtain any extra datasets that we need, including OCR data of the scanned books as well as human-annotated captions for the data. We will also be using the Flickr API to obtain additional existing tags and metadata as needed.

Legal
The existing code is already open-source, and any new code will conform to the same format. We will be licensing our code using the MIT open-source license.

Please provide a brief plan of how you will implement your idea by working with the Labs team


June 2016
- Consolidation of prior work regarding image tagging

- Start collection of ground-truth captioning via. Direct captioning or Amazon Mechanical Turk

- Initial work on building infrastructure for building RNN+CNN combined model

- Literature review

July 2016
- Continue collection of ground-truth data for image captioning

- Build and train first end-to-end model on AWS

August 2016
- Finish collection of ground-truth dataset for image captioning

- Start incorporation of OCR data into feature set for image tagging and captioning

- Training, cross-validation, and initial evaluation of captioning model

- Sketch wire-frames for web interface, tag suggestions mechanism

September 2016
- Iteratively improve model + architecture for model

- Continue incorporation of OCR data into feature set for image tagging and captioning

- Start development on web interface and tag suggestions tool

October 2016
- Use best-performing model to complete automatic tagging + captioning of entire dataset

- Finish integration of web interface with tag suggestions and captions

- Working with the BL Labs Technical team to integrate tags + captions

- Working with Flickr team to integrate tags + captions