"Indexing the BL 1 million, and Mapping the Maps" (Category: Research)

Name of Submitter(s): James Heald

Organisation: Private individual (Wikimedia volunteer)

The 1 million out-of-copyright images released by the British Library in November 2013 are a fantastic resource, but one in which particular content is hard to search for, and hard to organise for discovery: metadata is limited to author, title, publisher and date at the book level, with nothing at the level of the individual image. As a result when offered a bulk upload of the full collection, Wikimedia Commons felt this was an offer it could not accept, because without good metadata about the subject of the image at the image level, the images could not be made categorisable and so would simply not be discoverable.
My focus has therefore been to try to improve the availability of the resource, and make different parts of the content more discoverable and more well-grouped, with the ultimate motive of seeing more of the content made available via the deeply structured schemes at Wikimedia Commons.
As a first step, an index based on the subjects of the books from which the images were drawn was built up on-wiki, initially by hand, and then using shelfmarks with some ongoing further hand refinement. Although still rather rough-and-ready, this has acted as a useful guide for individuals to identify content of interest to them, which has so far led to over 20,000 images to be uploaded to Wikimedia Commons on a book-by-book basis.
However, the manual upload and description of the images that is involved, and especially the image-by-image categorisation into appropriate Wikimedia Commons categories, remains very time-consuming.
But one class of images that it should be possible to upload and categorise reasonably straightforwardly are maps and plans -- of which there are a large number in the 1,000,000 images, of locations from all over the world, because a particular focus area in the choice of books scanned were books on discovery, ethnography, travel, and local history worldwide.
In Spring 2014, the curator of digital mapping in the BL Maps department, Kimberly Kowal, had already successfully run a project to crowdsource the georeferencing of 3,000 images from the 1 million that taggers on Flickr had already identified as maps, using the platform the BL had developed with its longstanding georeferencing partner Klokan Technologies, and she was keen to find more to georeference, in part to keep satisfied the enthusiastic community of volunteer georeferencers she had built up, who were hungry for more. This was also an attractive set of content to think about uploading to Wikimedia Commons, of very real interest to illustrate for example the historical development of particular towns, or records of journeys of exploration, or ground-plans of buildings still extant today, or contemporary plans of battlefields, as well as for their original historical context -- whilst the geographical position information from the georeferencing process should at the very least make it possible to make a reasonable start at placing the images into the Wikimedia Commons category structure. So more georeferencing seemed very attractive; the one problem was that nobody knew where (or even how many) maps were there, hidden like needles in the haystack of the 1 million images.
However, the wiki-based index of the books, plus some scripts written to keep a progress-tracking page updated, made it possible to design a process for a geographically diverse online group of volunteers to go systematically through books of interest to them, identifying and tagging on Flickr any maps in the book images, removing a to-do marker template from the index for each book as it was done.
In this way, starting with a day-long face-to-face event in London organised by the British Library, and continuing online, a group of volunteers in November and December 2014 reviewed all the images, identifying and tagging on Flickr almost 30,000 previously uncatalogued maps and plans. In addition the project caught the attention of the computational artist Mario Klingemann (@Quasimondo), who single-handedly identified a further 20,000 map images using machine-supported pattern recognition methods. The 50,000 maps identified and tagged in all represent about five times the number originally suspected to be in the books.
The next stage of the project is now underway, that of georeferencing the discovered maps and plans using the BL/Klokan platform, by identifying geographical points in common that can be found on both the digitised image and a current map. With enough points, this gives the satisfaction of being able to view the old map laid perfectly over the new map, so that one can fade up and down between the two and see exactly how they compare; and it also reveals the precise location and scale information needed to characterise in detail what the map is of.
The stage of the process began at the end of March, and as of mid-September about 8,000 images of maps have now been georeferenced, in addition to the original 3,000 in 2014. The process is again being tracked on the wiki-index, which has been replicated with tagging on Flickr, allowing volunteers if they wish to choose maps to work on from all those from books from a particular index page for a particular part of the world, with scripts automatically tracking the progress as maps are georeferenced, keeping a main progress page updated, and removing to-do markers against individual books, analogous to the way markers were removed by hand in the previous stage, so that range of books remaining to do is immediately apparent, but in this stage tracked automatically.
Simultaneously, the map scale and central co-ordinates (plus analogues offset towards the four corners) are also fed to Open Street Map's Nominatim service, to try to identify ("geo-parse") what it is that the map may represent -- which continent, country, district, city or place it may represent; with the results added to a online/downloadable Google Fusion Table spreadsheet recording the results for all the georeferenced images to keep it updated in real time, and also added in the form of Flickr tags to the British Library's page for the image on Flickr, allowing direct retrieval on Flickr of all map-images from a particular continent or country or district or city corresponding to a particularly-sized feature -- eg all cathedral-sized buildings, or city-sized plans.
Currently under development is work to augment the results from Nominatim, by developing a library of bounding-boxes for frequently-occurring larger-scale features, and co-ordinate searching on Wikidata for smaller-scale features, to more precisely characterise what the images are maps of (since, for example, the Nominatim method at the moment seems to prefer to identify co-ordinate targets as streets rather than cathedrals, whereas currently Wikidata doesn't have so many streets in it). Together with an uploading script, if methods can be developed to identify these locations the most appropriate corresponding destinations (plural) in Wikimedia's endlessly baroque and unpredictable category structure, my hope (with a bit more work) is that should then open the way to bulk upload of all 11,000 images so far georeferenced to Wikimedia's new platform for georeferenced maps, and to Wikimedia Commons with a reasonably good initial categorisation, with ongoing near real-time upload of further georeferenced images -- which in turn I hope will be just the incentive to encourage the Wikimedia volunteer community to become involved in numbers, and start to make serious inroads into the 42,000 images now identified as maps as a result of the work last year, still remaining to be georeferenced.

URL for Entry: https://commons.wikimedia.org/wiki/Commons:British_Library/Mechanical_Curator_collection/georeferencing_campaign
Email: j.heald@ucl.ac.uk

Twitter: @heald_j

Job Title: volunteer individual

Background of Submitter:

I'm simply a volunteer, with some weekend-level Perl scripting skills.
But the wiki platform, and the set-up of the BL images on Flickr, makes it possible to do quite a lot with that.
The real unsung stars are the (really quite few) volunteers who have made mammoth contributions in the crowdsourcing efforts -- the couple of volunteers who took care of the whole of France and Germany respectively in the index building, the German volunteer single-handedly responsible for the lion's share of images uploaded to Wikimedia, the eight or so map taggers who each tagged more than 1000 images; Mario Klingemann for another 20,000; the BL's handful of really dedicated hard-core georeferencers. But also of course to everybody who made contributions of any size -- it all adds up.
Big thanks too to Kimberly Kowal, for initiating the BL's georeferencing programme in the first place, overseeing every aspect of it, and managing the key relationship with Klokan -- and of course to Mahendra Mahey and BL Labs, and in particular Ben O'Steen for creating the BL 1 million image collection in the first place, releasing the crucial underlying data, and all the bots to keep it going, track everything, and handle all the housekeeping -- wrangling some sometimes quite tempermental Flickr services.

Problem / Challenge Space:

The key aim was how to make the BL 1 million collection more accessible, to encourage its upload and further diffusion through platforms including Wikimedia Commons.
Part of the challenge has been to design a crowdsourcing process to enable individuals to be able to contribute to the overall process, in such a way that their contributions can build together to achieve the overall goal, and that they can see the progress that through their contributions is being made.

Approach / Methodology:

The technology is actually reasonably straightforward -- for the tagging stage, the tagging is all handled by Flickr, the marker templates on the wiki pages were removed by people manually, all I had to do was to write a script to poll those pages every 15 minutes and write the results to a progress page. The main thing really was to try to come up with a clear process, clear instructions, and then make sure everything kept running.
The scripting for the georeferencing stage has been slightly more involved; but again the heavy work is done by the Klokan georeferencer platform. I just need to monitor its results and update the progress pages and the index pages. The most involved thing has been to try take the coordinates and try to turn them into geographical features, and then send the corresponding tags to Flickr, which ended up running to quite a lot of code, sending the coordinates (and the 4 corner-wise perturbations of them) to Nominatim, then interpreting the results, and putting the 5 returns through a voting system -- and rolling all the code together into a single script that could then keep running automatically (and also updates the fusion table spreadsheet). Putting all that together took quite a bit of time -- I'm only an amateur weekend coder -- but it's nice once it's done, and it all just keeps going by itself.

Extent of showcasing BL Digital Content:

The whole project has been based on trying to showcase the BL 1 million content, to get people to become involved with it and to explore it, to make it more accessible, to diffuse it more widely, and ultimately to try to get it into very visible use, on Wikipedia article pages.

Impact of Project:

  • A detailed subject index now exists to the books scanned for the BL 1 million collection -- yes, a little rough-and-ready, but definitely usable.
  • 50,000 images have been identified as maps or plans, and appropriately made discoverable by tagging them on the BL Flickr pages.
  • 11,000 map images have been georeferenced, and searchably tagged on Flickr to indicate the geographical areas (at all scales) they are contained in, and the scale (extent) of the feature.
  • Silver award to the British Library and Wikimedia UK in the category:Open at the 2015 Muse Awards of the American Alliance of Museums
'Jurors said: “Love the fact that they backed this with stats and outcomes. Also impressive how many they managed to find and tag. The geo-referencing looks great as well.” And “This seems to be a very focused approach and the people involved in this project have shown great level of professional expertise.” '
  • Well-received presentation at Europeana Tech conference in February in Paris
(includes time-lapse of the progress of the tagging process, slides 10-25)
  • Updated version presented at Glam-Wiki 2015 in the Hague
(includes a little about the challenge of mapping located features to Commons categories, slides 40 to 48).

Issues / Challenges faced during project(s):

See above for some of the coding and design.
The main challenge remaining is as set out in the main text: better refinement of the features identified, an upload script to Commons, matching them to Commons' truly baroque category system, and coming up with a good set of workflow processes and maintenance categories for people to manually refine them. (But in the hope that I can get them to be pretty good to start with). That will all be going to need some work.
Other than that my main regret is that I have had less time than I had thought over the summer to get on with that, so I haven't yet got things as far ahead as I hoped I would have done -- I had hoped the above would all be done by now. It's a concern, because it's really (IMO) getting the real-time upload to Commons working that will be the big carrot that I hope will get wiki-volunteers to really get into the georeferencing. Once the system is there, then with luck I can really start to publicise it to the volunteers. The georeferencing is inevitably a bigger job and slower than the tagging; but I hope that that will give the momentum to get the process really going -- as it did last year.