MinHash LSH
MinHash LSH (Locality Sensitive Hashing) is a techinque to do near duplicate detection in a fast way. It is a type of LSH where the hasing part is done by MinHash.
If we have a LSH function we can check the hash of two documents and if it doesn't match the documents are different and if it matches the documents might be same. This gives a filter that gives us either NOT SAME or MAYBE SAME. Now similar to Bloom Filter, to make this even better we use mutiple hash functions. A typical LSH algorithm works this way:
- We have a document and we hash the document using multiple locality sensitive hash functions (say we have 100 hash functions).
- Then bucketing/banding of hashes is done the hash values. i.e. hashes are grouped into groups of 5.
- Each group is hashed. So, if we have 20 groups (i.e. 20 bands/buckets) then we have 20 chances to detect similarity.
Overall, MinHash LSH for text document involves the following steps:
- Shingling (Set Representation) - We break documents into sets of words or characters or n-grams. This gives a sparse vector representation for each document.
- Signature Generation - We apply MinHash to the spare vector to get the signatures.
- LSH Banding (Bucketing) - We divide the signatures into a \(b\) bands of \(r\) rows. Documents that share an identical signature within a single band are hashed into the same "bucket" and identified as candidate pairs
See this article for visualization and explanation of the MinHash algorithm and overall MinHash LSH procedure.