高效使用python计算汉明距离 [英] Using python efficiently to calculate hamming distances
问题描述
我需要比较大量类似于 50358c591cef4d76 的字符串.我有一个可以使用的汉明距离函数(使用 pHash).我如何有效地做到这一点?我的伪代码是:
I need to compare a large number of strings similar to 50358c591cef4d76. I have a Hamming distance function (using pHash) I can use. How do I do this efficiently? My pseudocode would be:
For each string
currentstring= string
For each string other than currentstring
Calculate Hamming distance
我想将结果作为矩阵输出并能够检索值.我还想通过 Hadoop Streaming 运行它!
I'd like to output the results as a matrix and be able to retrieve values. I'd also like to run it via Hadoop Streaming!
感谢收到任何指示.
这是我尝试过的,但速度很慢:
Here is what i have tried but it is slow:
import glob
path = lotsdir + '*.*'
files = glob.glob(path)
files.sort()
setOfFiles = set(files)
print len(setOfFiles)
i=0
j=0
for fname in files:
print 'fname',fname, 'setOfFiles', len(setOfFiles)
oneLessSetOfFiles=setOfFiles
oneLessSetOfFiles.remove(fname)
i+=1
for compareFile in oneLessSetOfFiles:
j+=1
hash1 = pHash.imagehash( fname )
hash2 = pHash.imagehash( compareFile)
print ...
推荐答案
距离 python 中的包提供了汉明距离计算器:
The distance package in python provides a hamming distance calculator:
import distance
distance.levenshtein("lenvestein", "levenshtein")
distance.hamming("hamming", "hamning")
还有一个 levenshtein 包,它提供了 levenshtein 距离计算.最后 difflib 可以提供一些简单的字符串比较.
There is also a levenshtein package which provides levenshtein distance calculations. Finally difflib can provide some simple string comparisons.
您现有的代码很慢,因为您在最内部的循环中重新计算文件哈希,这意味着每个文件都被多次哈希.如果你先计算散列,那么这个过程会更有效率:
Your existing code is slow because you recalculate the file hash in the most inner loop, which means every file gets hashed many times. If you calculate the hash first then the process will be much more efficient:
files = ...
files_and_hashes = [(f, pHash.imagehash(f)) for f in files]
file_comparisons = [
(hamming(first[0], second[0]), first, second)
for second in files
for first in files
if first[1] != second[1]
]
这个过程基本上涉及 O(N^2)
比较,因此以适合于 map reduce 问题的方式分发它涉及获取完整的字符串集并将它们划分为 B
块,其中 B^2 = M
(B = 字符串块数,M = 工人数).因此,如果您有 16 个字符串和 4 个工人,您会将字符串列表分成两个块(因此块大小为 8).一个划分工作的例子如下:
This process fundamentally involves O(N^2)
comparisons so to distribute this in a way suitable for a map reduce problem involves taking the complete set of strings and dividing them into B
blocks where B^2 = M
(B = number of string blocks, M = number of workers). So if you had 16 strings and 4 workers you would split the list of strings into two blocks (so a block size of 8). An example of dividing the work follows:
all_strings = [...]
first_8 = all_strings[:8]
last_8 = all_strings[8:]
compare_all(machine_1, first_8, first_8)
compare_all(machine_2, first_8, last_8)
compare_all(machine_3, last_8, first_8)
compare_all(machine_4, last_8, last_8)
这篇关于高效使用python计算汉明距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!