Today's blog is about image processing. In some down time, I've been trying to figure out what to do to manage my photo library that has grown over 25G. This makes it hard to just upload to the cloud (free services anyway). My first intention was to remove duplicates. But of course, its more fun to tackle the bigger issue. How can I index my library so I can actually query against it?
My first thoughts were; there needs to be a simple way to do some basic image comparisons, so what are they?
1) produce a grayscale histogram
2) produce a color histogram
3) create a CRC
4) gather meta data
If an image is exactly the same, regardless of file name and whatnot that can change if the image is copied, then the CRC (ie Adler) will be the same. The trick is, don't calculate the CRC on the file, but rather the contained image, because if the image is edited or saved (i.e. a copy paste), then the EXIF metadata will get lost. So, trick one, calculate the CRC on the pixel map and at the same time rip out the EXIF metadata to stuff into an XML metadata file. (Cool right? Too bad java libraries do not include mechanisms to extract EXIF from JPEG! But easy enough to borrow some libraries from open source projects.)
While calculating the CRC, that will tell you if the image is exactly the same, you still have the issue of "What if I rotated the image 90 or 180 degrees?" You will have a different CRC, so that mechanism won't help. But you can create a histogram on the image. Then you can compare histograms, and if the image is rotated so it is edge aligned (270,180,90) or bit flipped, if those histograms are equal, the image is the same with high probability. Also, images of the same subject in the same context (background) will also be similar.
So, if you have free rotate or scaled (or otherwise changed the image quality or cropped), the histogram might be a good starting guess. There will be false positives, but from a numerical standpoint, you can then move on to more intensive methods. How to compare? Cosine similarity works. The vector is the histogram, and you can fix the scaling issue by changing the frequency (count) of the color to the percentage (so count over total size). I have a sneaky suspicion this is how the brain does it. When our eyes scan an image, we look for discontinuity in the image (and tend to ignore anything with continuity. That's why "if it were any closer it would bite me.")
I am still thinking about false positives, or how to identify subjects in an image. For subjects, the cosine similarity will also work, but figuring out the threshold of what to suggest to a human could be fun. I need to play with that, but I figure a that with a divide and conquer based on histogram might work. Then also working in some extrapolation methods. I could divide and conquer to generate average colors in a section of an image, and/or use that to generate a polynomial for horizontal and vertical slices. If the coefficients are close, then there is a good bet the images are similar.