| Competition | Previous Entries & Ideas | Digital Collections | TOCs | FAQs | Judging | Resources and Tools | Submit Entry | Events |

Emanuil Tolev

Submitted Entry for 2013 Competition

Abstract

Seriousnews is a service which attempts to enable the investigation of scholarly sources in news articles which do note cite their sources. The idea is to:

1/ enable users of news sites to find scholarly evidence for claims made in news articles

2/ promote correctly referencing sources of knowledge among the authors of news articles by using bibliographic data held by the British Library.

The service lives at the British Library but also has a web browser widget part, to allow users to analyse news articles more easily.

For example, the BBC publishes an article titled “Knobbly reptile roamed vast ancient desert” [1]. The journal which published the paleontological findings is mentioned by name, but there is no link or any other form of reference provided to the scholarly paper which actually describes the findings. Seriousnews’ browser widget part would take the text of the news article and send it to the seriousnews service. This would perform term extraction, looking for words and phrases which would be most helpful in identifying the source of the knowledge in the news article. It would then search the British National Bibliography published by the British Library for scholarly publications which have the extracted key terms in the title or other metadata (e.g. keywords). Other bibliographic collections or metadata could be used as a data source too (for example, all the metadata from scholarly journals which the British Library has text mining rights for - a few thousand journals, apparently).

The browser widget would also react to a specific part of the text of a news article being highlighted. When this is the case, it would search the bibliographic data sources using the text selected by the user instead of trying to perform smart term extraction.

Care will be taken to allow more permanent storage of the results of the service such as allowing users to create simple profiles where they can add the results of an “investigation” performed by seriousnews on a news articles to a simple personal reading list. An even simpler method of persisting the results is sending them to the user’s e-mail address.

Anonymised usage data generated by the widget and the service can be used for the purpose of identifying situations, contexts and patterns where the public wishes to engage more closely with a piece of scholarly output. This will effectively shed more light on when news articles generate more views or impact for the scholarly outputs they are using as a source.

The name of the service resulting from this idea is subject to change following consultation with the British Library Labs team.

[1] British Broadcasting Corporation, "Knobbly reptile roamed vast ancient desert", June 2013, http://www.bbc.co.uk/news/science-environment-23032892

Assessment Criteria

The research question / problem you are trying to answer

Please focus on the clarity and quality of the research question / problem posed:

Understanding the role of everyday news publications as an public-scholarship engagement tool and impact driver for scholarly output through a practical application

Please explain the ways your idea will showcase British Library digital collections

Please ensure you include details of British Library digital collections you are showcasing (you may use several collections if you wish), a sample can be found at http://labs.bl.uk/Digital+Collections

Seriousnews is very much about connecting media reports with scholarly literature. However:

1/ the British Library’s National Bibliography available in machine-readable formats

2/ other potential data sources in the form of journal metadata that seriousnews has legal right to use while living at the British Library

3/ other data sources identified in cooperation with the British Library Labs team after the start of the project are vital - it is impossible to provide such a service without the data that describes scholarly publications. Even with term extraction or other analysis of news items, there would be no way to know what scholarly publications are available so that they can be linked up to news items.

Therefore, the project will:

1/ Credit The British Library in a obvious way on the seriousnews web browser widget, as well as the seriousnews service. One way would be to synthesise the content of this application section into a clear, very brief web page on the seriousnews service pointing interested users and technical developers to the British Library Labs and National Bibliography pages. If done properly, this could increase exploitation of the rich resource that is the National Bibliography as well as lead targeted traffic to the British Library Labs.

2/ Publish its anonymised usage statistics under an open license. The link between media and scholarly output is of interest to researchers, publishers, libraries as well as (to some extent) technical developers and the general public. The audience of such data would usually be interested in what other resources the British Library has to offer to innovation in academia and business. The outputs of the previous point would be used here to convert interest into seriousnews’ usage data into a more general introduction to the British Library digital resources.

Please detail the approach(es) / method(s) you are going to use to implement your idea, detailing clearly the research methods / techniques / processes involved

Indicate and describe any research methods / processes / techniques and approaches you are going to use, e.g. text mining, visualisations, statistical analysis etc.

Visualisation of usage data from the seriousnews service

Topical visualisation of news text data analysed by the seriousnews service (essentially showing what scholarly topics the media is interested in reporting)

Text mining and text analysis, term extraction of news text data

Please provide evidence of how you / your team have the skills, knowledge and expertise to successfully carry out the project by working with the Labs team

E.g. work you may have done, publications, a list with dates and links (if you have them)

The author has participated in a number of Open Knowledge and Open Higher Education projects in the past few years. All of the work has been undertaken in the context of working with the Cottage Labs partnership of software developers dedicated to Open Knowledge and Open Source. This is a team which can be relied on for pieces of this work as well, even though the author will be the main developer working on the project.

There are three particularly difficult aspects of this project:

1/ Searching across all relevant fields of the British National Bibliography.

The author (and the wider team) have experience in large-scale data processing, in particular indexing and searching metadata for academic publications, educational resources and funding data. There are several recent projects which have required such services:

a/ The G4HE project focusses on building tools which provide useful insight into the data gathered by the BIS-funded RCUK Gateway to Research (GtR) project - essentially, find and implement uses for the majority of national scholarly funding data in the UK, in particular uses of the data which Higher Education Institutions are interested in. This is a good example of where we’ve had to explore a large dataset for a particular application. We had to retrieve all of the data of the Gateway to Research project so that we could run aggregate analysis on it. We did this by indexing the data using a piece of software called elasticsearch, and could then run very quick searches across the whole dataset. Some more information is available here:

http://cottagelabs.com/projects/g4he

b/ We have processed the MEDLINE medical publications metadata dataset of about 20 million publication records using the same elasticsearch software and had a text search query on all metadata fields return within 5 seconds. The data has been used in multiple projects, usually to pinpoint potential for innovative links between different sets of data in order to generate useful insight into a field, or to prove the value of Open Data and Open Scholarship.

c/ Jorum Paradata was a mid-size project with Mimas (a national UK data centre, with focus on education) focussed on Open Educational Resources. Cottage Labs developed a two-tier web application which included a dashboard visualising the usage data of resources across time and by country. The author worked on thoroughly deduplicating and enhancing the app's main dataset of 15'000 records, including merging, canonicalisation, validation and enriching the dataset by adding information from external data sources.

http://beta.jorum.ac.uk/

d/ IDFind: The author finished the development and took on maintenance of this web application. It crowdsources identifier information - e.g. "what does a Digital Object Identifier or a ISBN look like?". Users can then give it a string that they believe is an identifier, but are unsure of what it is. The app is aimed at the text mining / PDF processing / page scraping community within the Open Knowledge field, as well as the occasional scholar.

http://test.cottagelabs.com/idfind/

It might be rather useful for identifying bits of news article text which seem to be identifiers of scholarly work (e.g. DOIs or PubMed IDs) - the automatic term extraction will be able to tell they’re not words, or that they look like identifiers, but not what kind of identifiers they are - that’s what IDFind is designed to do.

All our projects have an associated RESTful API which allows them to function as modular services whose data can be gathered automatically and reused. We design the API right into the architecture of all applicable web projects from the very start.

Two projects we have significantly contributed to may be of particular help in this case:

e/ Bibserver, an OKFN project, ( https://github.com/okfn/bibserver ) is a tool which manages collections of data. It was initially intended to manage bibliographic data, but has proven (in the OpenArticleGauge project) to be extremely useful as an archive layer for applications who need to deal with a lot of data and still perform well.

This is a highly relevant project as it’s designed to store and ingest bibliographic data, therefore the British National Bibliography and other possible data sources can be stored using this piece of software.

g/ Facetview, another OKFN project ( https://github.com/okfn/facetview ) is a web user interface component which allows the intuitive exploration of data via automated data analysis called "faceting" (analysing what values a particular field holds across a large dataset). The real-time search responses combined with the ability to filter on facets make drilling down on interesting data much easier. A good demonstration on a large dataset can be seen here: http://beta.jorum.ac.uk/find

We, as a team, would be particularly interested in seeing the British National Bibliography through facetview and the author would probably use this as a simple interface for the seriousnews service. This will allow faster exploration and getting an idea of what kind of data is available in the National Bibliography in the first place. Then it will be possible to tell in much more concrete detail how news articles and scholarly outputs can be linked together.

2/ Another potentially difficult part of this project will be retrieving the text of the news article that the user is looking at.

We have done some preliminary investigation into retrieving the text of news articles in machine-friendly form automatically, so that we could then process them in the seriousnews service for term extraction. It turned out that the BBC provides a little bit of information in machine-friendly form, but that does not include the content of news articles, or even the metadata for a particular news article - they give access to a list of their news “topics” (like Technology, Culture, etc.) and to data describing the latest news within these topics, but not to particular news articles.

The Guardian Open Platform does give access to individual news articles. The seriousnews browser widget will be able to use this to its great advantage when the reader is perusing a Guardian news article on the web.

It seems that a lot of media electronic outlets do not give machine-friendly information about their news articles, much less the text of the articles itself, even though their main purpose is the widest distribution of this information. We have encountered similar issues in the past with information relating to academic papers (not the content, just the publication metadata like license and reuse rights).

We had to use non-machine-friendly information - the content aimed at human readers - and try to guess what the information we were interested in was by “scraping” the human-targeted content. We partnered with Public Library of Science to produce a tool which would detect the license of a scholarly article and try to tell what its license was given nothing but the article’s identifier. The tool, OpenArticleGauge or IsItOpenAccess is accessible at http://oag.cottagelabs.com/

The author intends to use the same approach to support accessing the text of a news article the reader is interested in “investigating”. For media outlets which do not yet present machine-friendly access, scraping the content of the news article will be employed, building on the lessons learned from the OpenArticleGauge project.

3/ The last difficult aspect is term extraction and news text analysis.

The author has done basic term extraction when (successfully) trying to fix differences in organisation names in the Jorum Paradata project (see 1.c/ within this section). In essence, sometimes the metadata referred to “University of Nottingham”, at other times it was “Nottingham University”. The problem was dealt with by extracting a list of all possible organisations and creating a canonical list of names. This could have been automated but in this case took manual work to finish. The approach is thus not applicable to this project as-is, but the experience of building up a word list and then turning it back on the source data certainly is.

A partner in the Cottage Labs partnership, Mr. Mark MacGillivray is currently working on a component which would allow term extraction for scientific text. The software work is currently unpublished as it is part of a much larger doctoral dissertation - however, Mr. MacGillivray’s current expertise will prove valuable to this aspect of the seriousnews project.

This is doubtlessly a weak point of the seriousnews idea and will be appropriately addressed by allowing for much more time for term extraction in the work plan than usual for a piece of development work which is clear-cut. Of course, the seriousnews service is still valuable without the term extraction, as the user can still highlight the quote or bit of text they’re interested in while on the news article web page, and seriousnews can still search the British National Bibliography as well as any other data sources for it.
Please provide evidence of how you think your idea is achievable on a technical, curatorial and legal basis*

Indicate the technical, curatorial and legal aspects of the idea (you may want to check with Labs team before submitting your idea first).

Technical

The three main technical problem areas of this project (searching lots of scholarly metadata, getting news text, term exctraction) are described in the Previous Experience section above, including ways in which the author or Cottage Labs LLP have dealt with these problems before.

Curatorial

No new information will need to be collected or curated by the British Library Labs team(s) as part of the project.

Access to the British National Bibliography will be needed, but this is readily available through a Z39.50 point of access. The author’s preliminary investigation shows that there seem to be readily available open source components in the author’s preferred working development environment which can be used to communicate using this protocol.

Legal

The British National Bibliography is available for any non-commercial purpose. The conditions of other bibliographic data sources will be considered carefully before adding support for these sources or adding data from them to the seriousnews service. Since seriousnews is primarily a personal research support tool, it is expected that there will not be great legal difficulties in obtaining (or having by law) permission to use such bibliographic resources.

Legal status of news articles - this is quite a grey area, as the copyright of the articles of course belongs to the media organisations that publish them. Some of them, like the Guardian, have chosen to license the data openly - thus, their text can be used for personal research purposes without any doubt.

It is also true that personal research purposes is actually an example of Fair Use in copyright law. As long as seriousnews does not store the text of the news articles but rather only uses it (in the browser widget, or on the service side) for term extraction, then there might not be a legal problem or a clash with a media outlet’s Terms and Conditions at all, regardless of what these say.

It is important to remember that such innovative uses of data are best demonstrated first - if the copyright holders are unimpressed by the added value, then support for their sites can easily be removed from seriousnews. This is unlikely however - it would mean that other media outlets have an advantage just by letting seriousnews analyse their news articles (i.e. it would be possible to investigate the science behind Guardian articles more easily than investigating the articles from a hypothetical media outlet which has refused seriousnews access to its news text).

The author would strongly prefer that all software outputs are open source under a permissive license such as the MIT license, the new BSD license or the Apache 2.0 license (or other), depending on the legal requirements of the British Library itself. The whole scholarly sector and the Open Knowledge sector in particular have greatly benefited from the permissive sharing and reuse that such licenses allow. The author is ready to present more information in support of this point if required.

Please provide a brief plan of how you will implement your project idea by working with the Labs team

You will be given the opportunity to work on your winning project idea between July 6th - October 31st 2013

NOTES: All tasks to be performed by the author unless otherwise stated. Expertise may be brought in from Cottage Labs LLP if required. If this is done, it will be done in a transparent and clear way, such that the British Library Labs team(s) do not suffer from increased communication overhead on this particular project. The British Library already provides a stable interface to all required data - yet it is noted here that most of the tasks below do rely on the National Bibliography service being available, at least at the start of the project (and later on for updates to the National Bibliography).

July 6 - Onwards
1/ Desk-based research and analysis of the available metadata for the British National Bibliography, possibly conversing with the British Library team behind that project
2/ Procuring a sufficiently powerful computer system with enough operating memory to hold an index of this metadata
3/ Indexing of the available metadata into elasticsearch
4/ Investigating ways to gather news article content in more depth, focussing on the readily available machine-friendly Guardian API, as well as at least one case of scraping the text (perhaps focussing on the structure of the BBC Science and Technology section)

August 2013
5/ Investigating term extraction and analysis
6/ Getting the basic seriousnews service running
7/ Prototype browser widget (any browser - Javascript cross-browser bookmarklet or application, see Firebug Lite for an example)
8/ Getting the text of a news article and sending it the seriousnews service, as well as displaying the results on screen

September 2013

9/ More stable term extraction and analysis, enhanced with identifier detection if there is time (to detect DOI-s etc. in the text of articles)
10/ Improving the results display and options of what to do with the results of the seriousnews browser widget (e.g. send to email address)
11/ User testing - the service and widget together are meant to be useful in its own right by this point
12/ User testing on developers - the service itself is meant to be useful by this point as a place of doing term extraction and analysis of news text
13/ User profiles in the seriousnews service (if user testing finds them a useful addition), so that searches or “investigations” can be persisted on a per-user basis. The profiles will include an option to state whether the user agrees the data of what they did with the service to be used in an aggregated way or for statistical research purposes.

October 2013
14/ Based on the introduced user profiles - a small, simple, self-updating visualisation of the usage statistics of the seriousnews service (what are people using the service for, what people are interested in what data, if the data gathered by the user profile component allows for useful correlations to be made with consent)
15/ Simple self-updating topical visualisation, showing what kind of words are most popular in news articles that the service has analysed and what kind of scholarly outputs has it managed to find (the idea being, if the results of the service are of sufficient quality, this will also show what kind of research the media is likely to report on)
16/ Communication and marketing efforts - this is a service which can be used by researchers, as well as those who are in particular interested in analysis news text, but seriousnews is useful to the general public. As such, at least moderate effort should be put into marketing the service to users who would be interested in seeing the science behind the claims made by media. One way to do this would be to partner with media outlets who wish to have the ability to assert that their articles which refer to scholarship are trustworthy.
17/ Further work on term extraction and analysis as this is what will determine the perceived “quality” of the results the service produces
18/ Look into data sources for the service additional to the National Bibliography. Such data is (in the author’s experience) not usually readily shared by publishers and can take a project of its own to gather, but recommendations could be made for further development of seriousnews (if The British Library would like to take it further or host it after the initial 4-month period is completed)