That Looks Oddly Familiar
Perceptual hashing is a fascinating technique of summarising media files. It has little in common with cryptographic hashes such as SHA1. Two input files which look similar will end up having different cryptographic yet similar perceptual hashes. And by similar, we mean having most bits set the same way.
In this talk we'll combine pHash and a BK-tree to efficiently search through metric spaces of perceptual hashes. We will use Ruby to implement a simple command line tool. It will scan our photo library, hash all the pictures, and look for similarities. By the end of the talk, we'll have a complete list of similar and near-duplicate images needlessly occupying space on our hard drive.