代码拉取完成,页面将自动刷新
This is an efficient implementation of some functions that are useful for implementing near duplicate detection based on Charikar's simhash. It is a python module, written in C with GCC extentions, and includes the following functions:
Generate hashes:
>>> from simhash import fingerprint >>> hash1 = fingerprint(map(hash, "some text we want to hash")) >>> hash2 = fingerprint(map(hash, "some more text we want to hash"))
Measure distance between hashes:
>>> from simhash import hamming_distance >>> hamming_distance(hash1, hash2) 2L
This code was used from mapreduce jobs against a large dataset of webpages as part of a prototype at Scrapinghub.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。