2 Star 0 Fork 0

mirrors_scrapinghub/python-simhash

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
BSD-3-Clause

simhash

This is an efficient implementation of some functions that are useful for implementing near duplicate detection based on Charikar's simhash. It is a python module, written in C with GCC extentions, and includes the following functions:

fingerprint
Generates a fingerprint from a sequence of hashes
weighted_fingerprint
Generate a fingerprint from a sequence of (long long, weight) tuples
fnvhash
Generates a (FNV-1a) hash from a string
hamming_distance
Calculate the number of bits that differ between 2 long long integers
simpair_indices
Find the indices of hashes in a sequence that differ by less than a certain number of bits. It includes arguments for rotating and grouping hashes. It can be used to help efficiently implement online or batch near duplicate detection, for example as described in Detecting Near-Duplicates for Web Crawling by Gurmeet Manku, Arvind Jain, and Anish Sarma.

Example usage

Generate hashes:

>>> from simhash import fingerprint
>>> hash1 = fingerprint(map(hash, "some text we want to hash"))
>>> hash2 = fingerprint(map(hash, "some more text we want to hash"))

Measure distance between hashes:

>>> from simhash import hamming_distance
>>> hamming_distance(hash1, hash2)
2L

This code was used from mapreduce jobs against a large dataset of webpages as part of a prototype at Scrapinghub.

Copyright (c) Scrapinghub All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of Scrapinghub nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

简介

暂无描述 展开 收起
README
BSD-3-Clause
取消

发行版

暂无发行版

贡献者

全部

语言

近期动态

不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/mirrors_scrapinghub/python-simhash.git
git@gitee.com:mirrors_scrapinghub/python-simhash.git
mirrors_scrapinghub
python-simhash
python-simhash
master

搜索帮助