User:Fæ/Project list/Internet Archive
This is the batch upload project page for the Internet Archive book plate project. This page includes project background as well as usage reports and improvement recommendations for Category:Files from Internet Archive Book Images Flickr stream.
The batch upload project from the Internet Archive Flickrstream of books, is limited to images over 2,800 pixels on one side, which mostly returns large images which IA have cropped from book plates. These require some post-upload "housekeeping" as scanned blank pages are unavoidable, especially in the earlier book scans, and some of the data (such as publication year) may be inaccurate. See Village pump notice.
Introduction
For book uploads, the Internet Archive is the best source to get an original, preferably in djvu format which is then handy for Wikisource. The benefit of the Flickrstream run by the Internet Archive is that a specialized routine has been used to detect images while avoiding the text in books (photographs, cartoons, diagrams, illuminations, manuscript epistles etc.) and crop these from the individual book page. This adds significant value for Commons, as these images are often exactly what is needed to illustrate Wikipedia articles, or for more general reuse where the multi-page book itself, or even a whole page from it, is a less relevant format.
This project targets the larger images within the Flickrstream (latterly a minimum of 2,800 pixels on the longest side), hopefully getting high resolution entire book plates or large figures that illustrate a book. If anyone wishes to check through the rest of the book then the links are given on the image page text and an easy method of "uncropping" the figure so that it is replaced with the entire page is provided.
Project implementation
The Internet Archive Flickrstream has over five million images cropped from book pages in their collection. Books on the Internet Archive website are available in various readable formats, but referencing images within a book remains unavailable, though it is possible to link (and download) individual pages within a book. The Flickr site is a challenge for Commons as the automated cropping is experimental, and may give odd results, such as cropping a chart so that an image has just the chart lines but misses the title, keys or annotations. By restricting images chosen to only large images, the aim is to find book plates, mainly of a whole page size. After the first test run, images are filtered to be a minimum of 2,800 pixels on the longest side and at least 1,200 pixels on the shortest.
Un-cropping
Faebot monitors Category:Internet Archive (uncrop needed) so that images taken from Flickr can be automatically "uncropped" and replaced by their equivalent entire book page from Internet Archive. Doing this "manually" is possible by following the link given on the image page, then re-uploading, however to get the full-size image using the Internet Archive book page display, it is necessary to zoom in to the image around five times (it varies). As a work-flow this is time consuming and there is a high chance that a user would upload a lower resolution version than the maximum possible, or even the wrong page.
The back-link to the Internet archive book page includes the "physical" page number in the form "/n<page>".
Batch auto-cropping
A batch crop facility for "obvious" crop cases with significant blank margin areas has been created. This uses on-demand checks by Faebot (i.e. manually started rather than on a schedule) for images in Category:Autocrop by Faebot. This then runs the following procedure:
- Check the page border for a maximum of 50 pixels of "noisy" changes in standard deviation and crop towards the image center until the maximum deviation is lower than 8 (out of 255). This removes accidental edges showing the scanner background, or rough or damaged page edges and binding areas.
- Going by one pixel width at a time, crop border areas with standard deviation less than 8, heading towards the center of the image.
- Add to Category:Autocrop by Faebot (check) for human review or additional manual cropping with the standard CropTool.
Cases where this crop method would give poor results include small images on otherwise blank pages which are on one half of the page, pages with odd visible marks, stamps or text such as page numbers. Colour images of old documents are less likely to give good results as foxing, visible damage or fuzzy image boundaries will confuse the automated process.
Satisfactory results requiring no further changes, appear to be 80% to 95% of the time, depending on the type of images selected for auto-cropping. Past results are at Category:Images automatically enhanced by Faebot.
Missing volume numbers
A housekeeping routine scans for missing volume numbers and replaces date where there is a conflict. Periodicals and report series on Internet Archive often have a volume field in the metadata, this has not been carried over to the Flickrstream and the default for the archive is to name the "book" (which may be the entire journal run) by the earliest date in the journal series.
After some testing, this was integrated as a check on future uploads before writing to Commons.
Blank pages
A housekeeping task scans for blank pages. This runs locally in Python on a volunteer desktop at ~1,500 images/hour, it checks if the main body of an image—being the image with 7% cropped off as a border—has a maximum standard deviation of less than 8. This checks each colour channel (RGB) giving a range of 0-255, resulting in 3 standard deviation figures of which the largest is tested. The maximum of 8 being found by experimentation with sample images, rather than any theory. To save on bandwidth and processing time, thumbnails of 300 pixels wide are used, rather than the full sized image. The images are cropped so that borders that include book edges or discoloured page edges do not mislead detection. The suggested blanks are added to Category:Internet Archive (blank pages) for volunteer review and if removed rather than deleted, will not be re-added.
After a successful run as a housekeeping post-upload routine, this test has been added to the upload process so that detected blanks are skipped before upload is attempted. The pre-upload routine uses 'medium' sized files from Flickr, rather than full sized. Black & white detected images are filtered with a 15% crop and a deviation below 11, due to the generally poorer quality of the black and white scans.
- Example detected 'blanks' with central area to be tested, approximately highlighted in red
-
Bot-detected as blank, the left hand edge reflections and bottom label are outside the central area used to calculate standard deviation.
-
Detected as blank, with the darker book edge and accidental black scanner background cropped out, and the visible marks from the other side of the paper insufficient to go over the standard deviation limit.
-
Detected as black and white image, so a broader 15% margin applied along with a higher minimum standard deviation, making it more likely to be rejected as blank.
Mostly blank pages
Category:Internet Archive (mostly blank) has been populated with images where a large number of small sub-regions of the image have been detected as being blank, in this instance having a standard deviation of less than 5 given a pixel colour range of 255.
The test consists of breaking a 400 pixel wide version of the image down into a chess-board of 64 rectangles (50 pixels wide) and testing each rectangle for its maximum standard deviation. Files already categorized as blank pages, or previously removed from the mostly blank category are skipped from the test. Results include blank pages with noisy areas or wide borders, images that would benefit from a crop and unusual images with a lot of featureless single-colour regions; all are likely to be worth volunteer review to confirm they are within Commons:Project scope.
The metrics set for the categorization are:
- Standard deviation of a sub-zone must be less than 5 to be considered blank.
- The total number of blank sub-zones must be 32/64 or more.
- The number of central blank sub-zones must be at least 24/36 or at least 18/36 if more than half the 28 border zones are blank.
Identification of blank pages and "accidental" images such as the microfiche start arrow using this method is highly successful, with around 95% of images found being suitable for deletion as blanks. Many of the 'false' matches benefit from review or cropping/uncropping.
- Example false matches to the "mostly blank" test
Black and white
Black & White detection created, with files being added to Category:Files from Internet Archive Book Images Flickr stream (black & white). This later built into the upload process. Detection uses the medium sized Flickr image pulled into memory and used to test if the proposed image is blank, and at the same time works out how many colour channels it is using. If 1 channel then the image is black and white. The Internet Archive has some digitized microfiche film and photocopies in black and white, these are generally of poor quality with overly high contrast, or include text pages mistaken for images. For these reasons they are worth automatically isolating.
This is different to colour photographs and scans of black and white pages, these are colour images as the result will not be "pure" black and white.
Useful reference deletion discussions
These are deletion discussions where the copyright status of a publication has been in doubt and reviewed. These are worth checking before raising deletion requests for related material.
- Bulletin of the British Ornithologists' Club
- Commons:Deletion requests/File:Bulletin of the British Ornithologists' Club (20445944685).jpg
- Cranberries; - the national cranberry magazine
- Commons:Deletion requests/File:Cranberries; - the national cranberry magazine (1982) (20520589258).jpg
- Ladies' home journal
- Commons:Deletion requests/File:The Ladies' home journal (1889) (14580245160).jpg
- Die Cephalopoden, I. Teil
- Commons:Deletion requests/Files in Category:Die Cephalopoden, I. Teil
- Though publication was in 1921, the artist lived until 1949. Though considered public domain in the U.S., as publication was in Germany, the rule of life+70 means the images will be public domain in Germany in 2020.
Reports
Improvement suggestions
Ten randomly selected files with a single mainspace use on Wikimedia projects:
Up to ten randomly selected files with the lowest category counts in the project:
Wikimedia usage
-
1177
-
145
-
128
-
89
-
89
-
84
-
79
-
79
-
77
-
73
-
69
-
64
-
57
-
45
-
36
-
35
-
34
-
33
-
28
-
28
-
28
-
26
Most edited
-
45 edits -
44 edits -
43 edits -
43 edits -
43 edits -
43 edits -
43 edits -
43 edits -
42 edits -
42 edits -
42 edits -
42 edits -
42 edits -
42 edits -
42 edits -
42 edits -
42 edits -
42 edits -
42 edits -
42 edits -
42 edits -
42 edits -
42 edits -
42 edits