A page for blue-sky thinking.

"Page" numbers


"Page" numbers are in fact not page numbers from the book, but rather frame numbers for frames in the PDF.

This is unfortunate, because it means the numbers don't relate to real page numbers on eg "list of illustrations" pages, page numbers in printed copies of the text that people may have access to; or page numbers in different scans of the text (either other people's scans, or scans of duplicate editions that are in the Collection); or page numbers in an OCR'd text copy.

The Internet Archive's books project uses a good scheme, where numbers are used for the main sequence of page numbers, and other pages -- eg introduction pages, appendix pages, duplicated page numbers, or unidentifiable page numbers -- are given numbers in sequence with an "n" prefix, eg n15 vs 100. Would it make sense to adopt a scheme like this?

In the meantime, as a matter of urgency, can we change all instance of the existing "Page" tag into a "Frame" tag, before too many people start citing pages that aren't pages?

Unfortunately, the word "Page" is also used in all of the descriptions, and this also really ought to be changed.


[from benosteen]
Agreed. It's a little tricky to supply page identifiers (number, numeral, etc) that correspond to the book's logical ordering as that information is not present in the OCR data for it.

Page was used, as this it was commonly used here. We could shift to a more semantically correct tag name when I next do a pass updating descriptions for the images.


Ordering of the images


At the moment, images appear in apparently random order in the photostream, and in any searches made on it (eg: searches for all the particular images in a given title). This adds to the general bran-tub lucky dip atmosphere of jollity (a good thing), and helped e.g. Public Domain Review rapidly survey a random sample of 5,000 images to produce their gallery here

But on the other hand, it breaks up groups of images, so images from a particular chapter of a book, perhaps all on a particular place, don't get presented together; it loses any feel for the flow of images through a book; and, for example, the left hand page of a map will usually be separated from the right hand page. It also makes it difficult to compare the images to the "List of Illustrations" page of the book (for example Turner and Girtin's Picturesque Views list of plates
vs the images). On the other hand, it may be a good thing that 'small' images -- very often ornamental capital letters -- are currently often placed at the end.

So would it be a good thing to re-order the collection? According to the Flickr docs, one can do this by post-facto altering the upload dates. It should be possible to do this programmatically, through the Flickr API. If each image was given one second for its artificially-adjusted update date, it could be made to appear that the collection was uploaded over 12 days, so nothing grossly deceiving. Making it appear that the book had been uploaded from the last page first would make the plates appear in page-number order.

Is this a runner? Do people even think it would be a good idea?


[from benosteen]
Flickr's display and UI are unfortunately problematic and a lot of compromises are being made. The ordering is somewhat random and new images are surfaced from the collection to sit on the front page as those still get the lion's share of hits.

Flickr's page arrangement and ordering tends to make the images fit in place rather than preserve the explicit ordering of them. One possibility is to nail down an order for each volume by creating a set for each one (~65k) but it would remain to be seen how Flickr copes with that. We are already stretching its ability to handle simpler things.

Also I've noticed that for a number of tags (eg 'rotate' which is important next) the search results are seemingly not comprehensive. I have a script that harvests the various 'rotate-me' tags that have emerged from the community, but not all of the images that are tagged in this way are being returned in the results.

Automatic rotation


How much scope is there for automatically trying to guess whether images should be rotated? At the moment, very very many images are on their sides. But a fully manual approach is prohibitive. Tags have been suggested for crowd-tagging of images that need rotation; but this may take some time.

So how much scope is there for automatic identification -- eg perhaps from the orientation of any identifiable text, or identifying areas of ground and sky, or the orientation of faces? There might be a slew of false positives, but so many orientations are wrong now, that the gains would probably hugely outweigh the losses. Images auto-identified for rotation could perhaps be identified with a machine tag, 'rotate_c_candidate : cohort = xxxxx' to allow rapid manual review, before the rotation was carried out.

One issue is that rotation breaks the Flickr URL, so it would be good to do anything that we can soon, before people start citing a Flickr URL that we then break.

Identification of caption text


Have any attempts been made to identify caption text for images using OCR -- in particular, image titles ?

Also, is there a tag proposed for image titles ?

Use of text from "List of illustrations" pages


It may similarly be possible to extract titles from "List of illustrations" or "List of plates" pages.

It would be nice to try to create a dataset of which page(s) these are for each of the 50,000 odd volumes in the collection. This is certainly a thing we would love to have for eg the Full list of titles
and Title-level topic and place index pages on Wikimedia Commons.

Problems with the BL Itemviewer ?


The BL itemviewer seems to repeatedly give "The connection has timed out" errors.

Is there an issue with the software, or the linking, or is it being overwhelmed?

Also, is it possible to come up with more readily adapted and easier to understand URLs?

The Internet Archive's
https://archive.org/stream/turnergirtinspic00mill#page/112/mode/2up
is a lot easier for a human to understand and interpret (and modify) than something like the Itemviewer's
http://itemviewer.bl.uk/?itemid=lsidyv3c0d5ce4#ark:/81055/vdc_000000056249.0x00011D

Also, is there any plan to bulk-upload the texts to the Internet Archive, where they would be available on the same platform as perhaps comparable texts from other libraries?

Repeatedly occurring images


Once a woodcut or a steel engraving had been created, it was not uncommon for it to be re-used and re-used again and again, turning up like the proverbial counterfeit penny. (See the great comment on this image for one interesting "catch").

With decorative space-fillers and ornate set-off capital letters, which the collection is full of, the reiteration is even more extreme.

There could be great value in developing some code with a capability similar to Tineye or Google reverse image search, and then running it on the whole million images.

Such a thing may not be too hard -- at first glance, there are essentially three things that are needed: firstly an algorithm to develop a multiple dimensional 'fingerprint' for each image, which is as invariant as possible to re-sizing or rotation; secondly, a way to identify close near neighbours in such a space, perhaps using a k-d tree, and finally a direct pair comparison algorithm, to make the final call as to whether any particular pair of images should be considered sufficiently similar or not.

Such a thing may well be well documented already, and there may even already be open source code; so it would be good to investigate the research in the area.

I could see us adding a "stock_image_ddddddd" tag to the images, which would quicky allow all instances of a derivatives and analogues of a particular image to be retrieved. This would allow people to identify eg the best version of a particular plate, where we have several copies in the collection. We could track how a particular plate had been used and re-used from work to work. Make it easier to find further works on particular topics (because they re-used the same plates). Perhaps even a list of whose were the Top 100 most reused portraits.

And, of course, it potentially should hugely accelerate our image-level tagging, particularly if decorative elements, if it meant we could identify one, tag all.


Improving Discovery


As you've probably noticed the Internet Archive has started a similar project and have uploaded over 2 million images from their book scan and are supposedly uploading 14 million images in total. Digging through their collection I noticed that they added one little extra which makes the entire process of search and discovery so much more convenient than with the BL collection: in the description text they added two blocks: "Text appearing before image" and "Text appearing after image". This allows to use flickr's text search and works astonishingly well: https://www.flickr.com/search/?w=126377022@N07&text=steam%20train

So I wonder if there is a way to extract and add this kind of data also to the BL collection without breaking the image ids?