Samtla (Search And Mining Tools with Linguistic Analysis): A research environment for domain-specific corpora in the Digital Humanities.(Category: Research)

Name of Submitter(s): Prof Mark Levene, Dr Dell Zhang, Dr Dan Levene, Mr Martyn Harris

Organisation: Birkbeck University London, Southampton University.

Samtla (Search and Mining Tools for Linguistic Analysis) is a web-based domain-specific research environment for digital humanities as a response to a lack of tools for in-depth research of historic texts and for opening up archives to enable public access.
Samtla is a language independent framework that features approximate phrase search and document comparison, which are functionalities not provided by state-of-the-art general-purpose web search engines like Google and Bing. The retrieval algorithm of Samtla uses a character-based n-gram language model, which leads to greater flexibility in query processing, and allows the system to be corpus-agnostic and data-driven for fast deployment to new user groups.
Samtla includes text mining tools that help researchers to discover historic documents through keyword and phrase search, corpus browsing that leverages document metadata and named entity data to generate a directory or treemap, and a recommendation system for suggesting related queries as well as queries/documents that are popular or of interest to the research community.

For an overview, presentation, and publication on Samtla, please visit www.samtla.com.
URL for Entry: www.samtla.com

Email: mark@dcs.bbk.ac.uk, dell@dcs.bbk.ac.uk, d.levene@soton.ac.uk, martyn@dcs.bbk.ac.uk

Twitter:

Job Title: Professor of Computer Science and Assistant Dean (Head of Department), Senior Lecturer in Computer Science and MSc IT Programme Director, Reader in History, PhD Researcher.

Background of Submitter:

Our team is composed of computer scientists, historians, and linguists who have been working in collaboration to provide a universal platform for mining and analysing text corpora that are currently inaccessible due to a lack of tools. We have successfully developd search, browsing, and text mining tools over a number of historic collections, including: Aramaic Magic Bowls from Late Antiquity (Southampton University), Giorgio Vasari (Vasari Centre, the School of Arts, Birkbeck University), and the King James Bible (for demonstration purposes). We work in collaboration with researchers to develop tools that complement existing research methods, whilst helping researchers to access and analyse original sources faster than traditional methods.
Conferences:
2014: The anatomy of a search and mining system for the Digital Humanities. IEEE/ACM Joint Conference on Digital Libraries (JCDL).
2013: Samtla: Search And Mining Tools with Linguistic Analysis – Domain Specific Search through Language Modeling. Part of the Digital Humanities seminar series, Southampton University. Video link: http://bit.ly/samtla_video
2012: Samtla: Search And Mining Tools with Linguistic Analysis. Digital Humanities Conference, Sheffield University
Papers:
Harris, M.; Levene, M.; Zhang, D.; Levene, D.,The anatomy of a search and mining system for digital humanities, Digital Libraries (JCDL), 2014 IEEE/ACM Joint Conference on, pp.165,168, 8-12 Sept. 2014 doi: 10.1109/JCDL.2014.6970163

Problem / Challenge Space:

The Samtla system addresses a need for intelligent document search and text mining tools in the digital humanities where few currently exist. Those tools that are available are typically tied to a single corpus (e.g., the works of Shakespeare), restricted to a set of well-known or popular languages (e.g., English and French), or with quite limited functionalities.
The Samtla system aims to use character-based statistical language models to provide a computational framework that is both language independent, and flexible in terms of querying. We are hoping to collaborate with the British Library Labs to further improve the Samtla system’s performance, scalability, and usability, so that it can handle very larger datasets and support many more researchers.

Approach / Methodology:

Search:
A statistical language model (SLM) is a mathematical model representing the probabilistic distribution of words, or in our case, sequences of characters, found in natural language represented by text corpora. Since Samtla is designed as a language agnostic search tool, it adopts a character-based n-gram SLM, rather than the more conventional word-based SLM. An optimised suffix tree data structure is used to implement such a model efficiently. This novel approach provides Samtla with a consistent methodology for retrieving and ranking search results according to the underlying structure of the language present in a corpus of documents, which is often domain specific.
A character-based n-gram SLM enables the system to be applied to multilingual corpora with very little pre-processing of the documents, unlike word-based systems. For example, Semitic languages like Aramaic and Hebrew attach affixes to a root word to identify syntactic relationships. This complicates word-based retrieval models since it is necessary to capture all instances of the same word in order to produce an accurate probabilistic model. While word-based models typically require language-dependent stemming, part-of-speech tagging, and text segmentation algorithms, a character-based model has been shown to deal with these issues much more easily and even achieve better retrieval performance. Furthermore, a character-based n-gram SLM enables the system to be applied to not only homogeneous corpora each of which in a different language, but also heterogeneous corpora that contain documents written in mixed languages and scripts. For example, some documents in the Aramaic collection contain texts written in Hebrew, Judeo-Arabic, Syriac, Mandaic, and Aramaic, whereas the Vasari corpus contains English and Italian documents. The British Library Microsoft corpus contains 9 different language including: English, French, Spanish, German, Hungarian, and Russian.
Moreover, the SLM built on a suffix tree provides a method for recommending queries to the users based on the statistics of the language corpus, allowing them to refine their query or find alternative search terms, represented by orthographic, morphological differences, and OCR errors.
Finally, a metadata filter is provided as part of the search results, to help users narrow down their search, for instance, to books referencing a particular geographical area.
Browsing:
The browsing tool allows users to explore the corpus using a traditional file directory list view, or a treemap. The browsing tool recursively partitions the metadata by field and presents the top level categories, which helps to prevent the user from being overwhelmed by too much information in one go. As an example, a user interested in books published in 1810 can simply navigate from a top-level category of publication date, to 1800s, and then 1810, where they will find a list of books published in that particular decade.
The novelty of this tool is that the treemap cells can be scaled or coloured to represent different features of the sub-collections, or even a summary of the content.
Document view:

Once a user has located a document they can view it alongside the metadata record including the shelf mark of the book in the British Library, the notes, the publisher, and the resource type etc. A simple navigation is provided to enable users to advance through the book. The user can also navigate to third-party sources of information where available.
User and Community:
Through analysing the usage log data, Samtla can inform users of the top-10 most popular queries and documents in the research community, so as to support users' collaborative search. Thus a user of Samtla can be directed to the "interesting'' aspects of the corpus being studied, which may not have occurred to them previously.
The popular queries and documents are ranked and selected using an algorithm similar to the Adaptive Replacement Cache (ARC), where the frequency of each query or document is combined with its recency (measured by the number of days that have passed since the last submission of the query or document view). This ensures that the recommended popular queries and documents are biased towards fresh ones and updated along with time, preventing submissions with high counts, but longer time between submissions, from dominating the top entries of the recommended queries or documents.
The ranked results are made accessible via the respective side-bars in the user interface, which are populated when the user navigates to a document through browsing or searching.

Extent of showcasing BL Digital Content:

The Samtla system has indexed a version for the British Library’s Microsoft corpus of scanned books and opened it to researchers. With Samtla, one can easily search, browse, explore, and view this huge collection of multilingual books from the 15th century to modern day in a number of novel ways using the rich metadata as well as the archive of scanned book images.

Impact of Project:

In addition to the Microsoft corpus, Samtla currently operates with the King James Bible (in English), Aramaic Magic Bowls from Late Antiquity (Aramaic, Mandaic, Syriac texts), and Giorgio Vasari’s “Lives of the most Excellent Artists and Architects” (both English and Italian translations). Furthermore, we have been successful in obtaining a College fund for development of another Samtla version as part of a pilot project in collaboration with the British Library News and Media team, as a result of the work we have done with the Microsoft book corpus.
The system has been successfully demonstrated in information retrieval, and digital humanities conferences and workshops. It has led to a publication in the prestigious JCDL conference, and a journal paper is on the way.

Issues / Challenges faced during project(s):

One of the key issues that required addressing was scalability. When faced with a large document collection even a few milliseconds required for reading and writing to a database can add up to create a lag of several seconds, which negatively affects the users’ perception of the system’s utility.
In addition, the physical size of the expanded corpus including data generated by Samtla (suffix tree, language model, treemap) runs in to several terabytes of data. We adopted a number of techniques to address this issues:
1.) Pre-processing:
One of the simplest methods to improve scalability was to partition the dataset to manageable-size chunks in order to keep the system as responsive as possible. For example, the browsing tool was scaled by simply partitioning the metadata in to further categories, for instance, dates were partitioned according to century, then decade, and year in order to reduce the time needed to generate the page, and reduce the number of results to a manageable size for the researcher to explore. In some instances we used DBpedia and other Wikipedia sources to obtain top-level categories for instance, we queried DBpedia to find the continent or country name on which to cluster city names.
The search tool adopts a lazy-loading approach to displaying search results. When the user scrolls past a pre-defined limit in the search results, the system asynchronously loads additional results.
The document view renders the document text by unzipping the xml file on-the-fly. This has enabled us to keep the storage requirements as low as possible, since the fully-expanded version is many times larger than the compressed version.
2.) Metadata Search:
We have found that it is often sufficient to use the metadata to search and locate books in a large document collection such as the Microsoft corpus. This has been a recent addition to the Samtla framework in response to the scalability challenges, and we are currently looking at further methods to support full-book search.
3.) Hardware:
Solid State Drives (SSDs) provide fast read and write speeds compared to traditional hard drives due to the fact that they use flash memory and consequently have no moving parts. We have installed a number of them on our servers to store the index data for time-critical tasks, i.e. the probabilistic ranking of search results using the character-based SLM.
The traditional generalised suffix tree structure can consume a lot of random access memory during construction, we have adopted a truncated suffix tree, which limits the depth of the tree and enables us to load it into the main memory. We are also investigating algorithms for disk-based construction of suffix trees on SSDs.
Another issue we encountered was the low accuracy of OCR, which can cause poor retrieval performance in terms of precision, especially for word-based search models. This issue is addressed through our character-based approach, which allows the system to capture as much of the query as possible, and performs well for long verbose queries such as quotations, set phrases, and formulaic expressions (like those found in religious texts).