MinHash LSH

MinHash LSH (Locality Sensitive Hashing) is a techinque to do near duplicate detection in a fast way. It is a type of LSH where the hasing part is done by MinHash.

If we have a LSH function we can check the hash of two documents and if it doesn't match the documents are different and if it matches the documents might be same. This gives a filter that gives us either NOT SAME or MAYBE SAME. Now similar to Bloom Filter, to make this even better we use mutiple hash functions. A typical LSH algorithm works this way:

We have a document and we hash the document using multiple locality sensitive hash functions (say we have 100 hash functions).
Then bucketing/banding of hashes is done the hash values. i.e. hashes are grouped into groups of 5.
Each group is hashed. So, if we have 20 groups (i.e. 20 bands/buckets) then we have 20 chances to detect similarity.

Overall, MinHash LSH for text document involves the following steps:

Shingling (Set Representation) - We break documents into sets of words or characters or n-grams. This gives a sparse vector representation for each document.
Signature Generation - We apply MinHash to the spare vector to get the signatures.
LSH Banding (Bucketing) - We divide the signatures into a \(b\) bands of \(r\) rows. Documents that share an identical signature within a single band are hashed into the same "bucket" and identified as candidate pairs

See this article for visualization and explanation of the MinHash algorithm and overall MinHash LSH procedure.

Backlinks

LLM Pre-Training