高效使用python计算汉明距离 [英] Using python efficiently to calculate hamming distances

查看:114
本文介绍了高效使用python计算汉明距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要比较大量类似于 50358c591cef4d76 的字符串.我有一个可以使用的汉明距离函数(使用 pHash).我如何有效地做到这一点?我的伪代码是:

I need to compare a large number of strings similar to 50358c591cef4d76. I have a Hamming distance function (using pHash) I can use. How do I do this efficiently? My pseudocode would be:

For each string
    currentstring= string
    For each string other than currentstring
        Calculate Hamming distance

我想将结果作为矩阵输出并能够检索值.我还想通过 Hadoop Streaming 运行它!

I'd like to output the results as a matrix and be able to retrieve values. I'd also like to run it via Hadoop Streaming!

感谢收到任何指示.

这是我尝试过的,但速度很慢:

Here is what i have tried but it is slow:

import glob
path = lotsdir + '*.*'
files = glob.glob(path)
files.sort()
setOfFiles = set(files)
print len(setOfFiles)
i=0
j=0
for fname in files:
    print 'fname',fname, 'setOfFiles', len(setOfFiles)
    oneLessSetOfFiles=setOfFiles
    oneLessSetOfFiles.remove(fname)
    i+=1

    for compareFile in oneLessSetOfFiles:
        j+=1
        hash1 = pHash.imagehash( fname )
        hash2 = pHash.imagehash( compareFile)
        print ...     

推荐答案

距离 python 中的包提供了汉明距离计算器:

The distance package in python provides a hamming distance calculator:

import distance

distance.levenshtein("lenvestein", "levenshtein")
distance.hamming("hamming", "hamning")

还有一个 levenshtein 包,它提供了 levenshtein 距离计算.最后 difflib 可以提供一些简单的字符串比较.

There is also a levenshtein package which provides levenshtein distance calculations. Finally difflib can provide some simple string comparisons.

这个老问题上有更多信息和示例代码.

您现有的代码很慢,因为您在最内部的循环中重新计算文件哈希,这意味着每个文件都被多次哈希.如果你先计算散列,那么这个过程会更有效率:

Your existing code is slow because you recalculate the file hash in the most inner loop, which means every file gets hashed many times. If you calculate the hash first then the process will be much more efficient:

files = ...
files_and_hashes = [(f, pHash.imagehash(f)) for f in files]
file_comparisons = [
    (hamming(first[0], second[0]), first, second)
    for second in files
    for first in files
    if first[1] != second[1]
]

这个过程基本上涉及 O(N^2) 比较,因此以适合于 map reduce 问题的方式分发它涉及获取完整的字符串集并将它们划分为 B 块,其中 B^2 = M(B = 字符串块数,M = 工人数).因此,如果您有 16 个字符串和 4 个工人,您会将字符串列表分成两个块(因此块大小为 8).一个划分工作的例子如下:

This process fundamentally involves O(N^2) comparisons so to distribute this in a way suitable for a map reduce problem involves taking the complete set of strings and dividing them into B blocks where B^2 = M (B = number of string blocks, M = number of workers). So if you had 16 strings and 4 workers you would split the list of strings into two blocks (so a block size of 8). An example of dividing the work follows:

all_strings = [...]
first_8 = all_strings[:8]
last_8 = all_strings[8:]
compare_all(machine_1, first_8, first_8)
compare_all(machine_2, first_8, last_8)
compare_all(machine_3, last_8, first_8)
compare_all(machine_4, last_8, last_8)

这篇关于高效使用python计算汉明距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆