User:Multichill/Using OpenCV to categorize files
At the time of writing Commons contains about 150.000 uncategorized files. This is only about 1,25% of all files, but it's always nice to be able to lower the number even further. A lot of categorization work has already been done by the CategorizationBot, but this work is all done based on usage of a file. No categorization has been done based on the contents of the file itself.
OpenCV (Open Source Computer Vision) is a library of programming functions for real time computer vision. It can be used to "recognize" images. OpenCV could be used to move uncategorized files to one of the unidentified topics categories based on the image characteristics. OpenCV contains several approaches we could use to "recognize" images:
- cascade classification
- Machine learning
- Object Categorization
- Normal Bayes classifier: Using the Normal Bayes classifier for image categorization in OpenCV
- Bag of Words model: The Bag of Words model in OpenCV 2.2
- bagofwords_classification.cpp can be imported as python module with help of Boost.Python
- Sample dataset for training
- Object Categorization
Some frequently occurring subjects in uncategorized files:
- People, could go to Category:Unidentified people
- Maps, could go to Category:Unidentified maps
- Flags, could go to Category:Unidentified flags
- Plants, could go to Category:Unidentified plants
- Coats of arms, could go to Category:Unidentified coats of arms
- Buildings, could go to Category:Unidentified buildings
- Trains, could go to Category:Unidentified trains
- Automobiles, could go to Category:Unidentified automobiles
- Buses, could go to Category:Unidentified buses
- Diagrams
I installed OpenCV as explained here:
- I already had Python2.7 installed
- Installed the Python eggs of NumPy and SciPy
- Downloaded and installed the (rather large) Windows package
- Copied the contents of "C:\opencv\build\python\2.7" to "C:\Python27\Lib\site-packages"
In the C:\opencv\samples\ directory there are two folders with example python programs. Fun and useful to play around with!
The first test is to use a already ready classifier to do face detection in combination with Pywikipedia to fill Category:Unidentified people (bot tagged). The first results look promising. I see a lot of faces, but also some false positives. Next step is probably to start training some filters based on Commons images.
Look also at User:DrTrigonBot since it has similar python code.