对Winkler的Python性能改进要求 [英] Python performance improvement request for winkler

查看:74
本文介绍了对Winkler的Python性能改进要求的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python n00b,我想对如何改进算法以提高该方法的性能提出一些建议,以计算两个名称的Jaro-Winkler距离.

I'm a python n00b and I'd like some suggestions on how to improve the algorithm to improve the performance of this method to compute the Jaro-Winkler distance of two names.

def winklerCompareP(str1, str2):
"""Return approximate string comparator measure (between 0.0 and 1.0)

USAGE:
  score = winkler(str1, str2)

ARGUMENTS:
  str1  The first string
  str2  The second string

DESCRIPTION:
  As described in 'An Application of the Fellegi-Sunter Model of
  Record Linkage to the 1990 U.S. Decennial Census' by William E. Winkler
  and Yves Thibaudeau.

  Based on the 'jaro' string comparator, but modifies it according to whether
  the first few characters are the same or not.
"""

# Quick check if the strings are the same - - - - - - - - - - - - - - - - - -
#
jaro_winkler_marker_char = chr(1)
if (str1 == str2):
    return 1.0

len1 = len(str1)
len2 = len(str2)
halflen = max(len1,len2) / 2 - 1

ass1  = ''  # Characters assigned in str1
ass2  = '' # Characters assigned in str2
#ass1 = ''
#ass2 = ''
workstr1 = str1
workstr2 = str2

common1 = 0    # Number of common characters
common2 = 0

#print "'len1', str1[i], start, end, index, ass1, workstr2, common1"
# Analyse the first string    - - - - - - - - - - - - - - - - - - - - - - - - -
#
for i in range(len1):
    start = max(0,i-halflen)
    end   = min(i+halflen+1,len2)
    index = workstr2.find(str1[i],start,end)
    #print 'len1', str1[i], start, end, index, ass1, workstr2, common1
    if (index > -1):    # Found common character
        common1 += 1
        #ass1 += str1[i]
        ass1 = ass1 + str1[i]
        workstr2 = workstr2[:index]+jaro_winkler_marker_char+workstr2[index+1:]
#print "str1 analyse result", ass1, common1

#print "str1 analyse result", ass1, common1
# Analyse the second string - - - - - - - - - - - - - - - - - - - - - - - - -
#
for i in range(len2):
    start = max(0,i-halflen)
    end   = min(i+halflen+1,len1)
    index = workstr1.find(str2[i],start,end)
    #print 'len2', str2[i], start, end, index, ass1, workstr1, common2
    if (index > -1):    # Found common character
        common2 += 1
        #ass2 += str2[i]
        ass2 = ass2 + str2[i]
        workstr1 = workstr1[:index]+jaro_winkler_marker_char+workstr1[index+1:]

if (common1 != common2):
    print('Winkler: Wrong common values for strings "%s" and "%s"' % \
                (str1, str2) + ', common1: %i, common2: %i' % (common1, common2) + \
                ', common should be the same.')
    common1 = float(common1+common2) / 2.0    ##### This is just a fix #####

if (common1 == 0):
    return 0.0

# Compute number of transpositions    - - - - - - - - - - - - - - - - - - - - -
#
transposition = 0
for i in range(len(ass1)):
    if (ass1[i] != ass2[i]):
        transposition += 1
transposition = transposition / 2.0

# Now compute how many characters are common at beginning - - - - - - - - - -
#
minlen = min(len1,len2)
for same in range(minlen+1):
    if (str1[:same] != str2[:same]):
        break
same -= 1
if (same > 4):
    same = 4

common1 = float(common1)
w = 1./3.*(common1 / float(len1) + common1 / float(len2) + (common1-transposition) / common1)

wn = w + same*0.1 * (1.0 - w)
return wn

示例输出

ZIMMERMANN  ARMIENTO    0.814583333
ZIMMERMANN  ZIMMERMANN  1
ZIMMERMANN  CANNONS         0.766666667
CANNONS AKKER           0.8
CANNONS ALDERSON    0.845833333
CANNONS ALLANBY         0.833333333

推荐答案

我更关注于优化以从Python中获得更多收益,而不是优化算法,因为我认为并没有太多算法上的改进这里.这是我想到的一些Python优化.

I focused more on optimizing to get more out of Python than on optimizing the algorithm because I don't think that there is much of an algorithmic improvement to be had here. Here are some Python optimizations that I came up with.

(1).由于您似乎正在使用Python 2.x,因此将所有range()更改为xrange(). range()会在迭代之前生成完整的数字列表,而xrange会根据需要生成数字.

(1). Since you appear to be using Python 2.x, change all range()'s to xrange()'s. range() generates the full list of numbers before iterating over them while xrange generates them as needed.

(2).对max和min进行以下替换:

(2). Make the following substitutions for max and min:

start = max(0,i-halflen)

start = i - halflen if i > halflen else 0

end = min(i+halflen+1,len2)

使用

end = i+halflen+1 if i+halflen+1 < len2 else len2

在第一个循环中

,在第二个循环中类似.在函数的开头还有一个更小的min()和一个max(),对它们也是如此.替换min()和max()确实有助于减少时间.这些功能很方便,但比我用它们替换的方法要贵.

in the first loop and similar ones for the second loop. There's also another min() farther down and a max() near the beginning of the function so do the same with those. Replacing the min()'s and max()'s really helped to reduce the time. These are convenient functions, but more costly than the method I've replaced them with.

(3).使用common1而不是len(ass1).您已经在common1中跟踪了ass1的长度,因此让我们使用它而不是调用昂贵的函数来再次查找它.

(3). Use common1 instead of len(ass1). You've kept track of the length of ass1 in common1 so let's use it rather than calling a costly function to find it again.

(4).替换以下代码:

(4). Replace the following code:

minlen = min(len1,len2)
for same in xrange(minlen+1):
    if (str1[:same] != str2[:same]):
        break
same -= 1

使用

for same in xrange(minlen):
    if str1[same] != str2[same]:
        break

其原因主要是str1 [:same]每次在循环中都会创建一个新字符串,并且您将检查已经检查过的部分.另外,如果我们不必这样做,则无需再检查'' != ''并随后减小same.

The reason for this is mainly that str1[:same] creates a new string every time through the loop and you will be checking parts that you've already checked. Also, there's no need to check if '' != '' and decrement same afterwards if we don't have to.

(5).使用 psyco (一种即时编译器).下载并安装后,只需添加

(5). Use psyco, a just-in-time compiler of sorts. Once you've downloaded it and installed it, just add the lines

import psyco
psyco.full()

使用文件顶部的

即可使用它.除非您进行了我提到的其他更改,否则请不要使用psyco.出于某种原因,当我在您的原始代码上运行它时,它实际上放慢了速度.

at the top of the file to use it. Don't use psyco unless you do the other changes that I've mentioned. For some reason, when I ran it on your original code it actually slowed it down.

使用timeit,我发现前4次更改使我的时间减少了大约20%.但是,当我将psyco与这些更改一起添加时,代码比原始代码快3到4倍.

Using timeit, I found that I was getting a decrease in time of about 20% or so with the first 4 changes. However, when I add psyco along with those changes, the code is about 3x to 4x faster than the original.

如果您想提高速度

相当多的剩余时间在字符串的find()方法中.我决定尝试用我自己的代替.对于第一个循环,我替换了

A fair amount of the remaining time is in the string's find() method. I decided to try replacing this with my own. For the first loop, I replaced

index = workstr2.find(str1[i],start,end)

使用

index = -1
for j in xrange(start,end):
    if workstr2[j] == str1[i]:
        index = j
        break

和第二个循环的类似形式.如果没有psyco,这会减慢代码的速度,但是对于psyco,它会大大加快代码的速度.经过最后的更改,代码比原始代码快8到9倍.

and a similar form for the second loop. Without psyco, this slows down the code, but with psyco, it speeds it up quite a lot. With this final change the code is about 8x to 9x faster than the original.

如果速度不够快

然后您可能应该转向制作C模块.

Then you should probably turn to making a C module.

祝你好运!

这篇关于对Winkler的Python性能改进要求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆