MinHash | Clustify Blog – eDiscovery, Document Clustering, Predictive Coding, Information Retrieval, and Software Development

<![endif] Clustify Blog – eDiscovery, Document Clustering, Predictive Coding, Information Retrieval, and Software Development Menu Skip to content Blog Home Clustify Main Site About Contact Resources #site-navigation #masthead MinHash .archive-header What is a near-dupe, really? 1 Reply .comments-link .entry-header When you try to quantify how similar near-duplicates are, there are several subtleties that arise.  This article looks at three reasonable, but different, ways of defining the near-dupe similarity between two documents.  It also explains the popular MinHash algorithm, and shows that its results may surprise you in some circumstances. Near-duplicates are documents that are nearly, but not exactly, the same.  They could be different revisions of a memo where a few typos were fixed or a few sentences were added.  They could be an original email and...

Linked on 2014-10-16 06:41:10 | Similar Links