模糊查找 [英] Fuzzy Lookups

查看:72
本文介绍了模糊查找的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些CD,并已在PC上存档。我写了一个Python

脚本,该脚本跨越存档并返回其内容列表:

[[genre,artist,album,song] ...]。我想添加搜索功能

找到特定歌曲的所有版本。这比你想象的更难。例如,Cajun国歌。是Jolie Blond,

但它可以用几种不同的方式拼写jolie,joli,blon,blond,

等...另外提供歌曲信息的各种在线服务

充满了拼写错误,所以一个普通的字符串匹配就不会在那里得到
。我们需要的是一个模糊的字符串匹配,结果是

有一个非常好的,Levenshtein距离,这是插入,删除和替换的数字
需要将一个字符串转换为

另一个字符串。在我的应用程序中,我将所需的歌曲标题与我列表中的所有

歌曲标题相匹配,并返回Levenshtein

距离最低的歌曲。这是非常值得一提的,人们可能会惊人地说,

有效,轻松找到列表中所有版本的Jolie Blon。


我正在使用以下代码片段(在网上找到,正确的归因

不确定),它计算Levenshtein的距离。


def distance(a,b):

c = {}

n = len(a); m = len(b)


我在范围内(0,n + 1):

c [i,0] = i

表示范围内的j(0,m + 1):

c [0,j] = j


for i in range(1,n +1):

范围内的j(1,m + 1):

x = c [i-1,j] +1

y = c [i,j-1] +1

如果[i-1] == b [j-1]:

z = c [i-1 ,j-1]

else:

z = c [i-1,j-1] +1

c [i,j] = min(x,y,z)

返回c [n,m]


如上所述,这很有效,我很满意,但是我想知道是否有更多Pythonic方式进行此类查找?


jab

解决方案

>如上所述,这很有效,我很满意它,但我

想知道是否有更多Pythonic方式进行这种类型的查找?



我自己做了一个levenshtein-fuzzy-search,但是我通过

增强了我的版本以下列方式对距离进行了规范化:


def relative (a,b):

"""

计算两个字符串之间的相对距离。它在范围内

(0-1),其中1表示完全相等。

@type a:string

@param a:arg one

@type b:string

@param b:arg two

@rtype:float

@return:距离

"""

d =距离(a,b)

更长=浮动(最大值((len(a) ,len(b))))

short = float(min((len(a),len(b))))

r =((更长 - d) /更长)*(更短/更长)

返回r

距离是与你一样的磨坊levenshtein距离。


当你试图比较


" Angelina Jolie"




" AngelinaJolei"




" Bob"


两者的l-dist均为3,但标准化距离为

0.729591836735




0.015306122449


- 让你选择b etter match :)


问候,


Diez


Diez B. Roggisch写道:

当你尝试例如,优势变得明显比较

Angelina Jolie

与<< AngelinaJolei'



Bob

两者的l-dist为3


距离(Angelina Jolie,AngelinaJolei)
3距离(Angelina Jolie,Bob)



13


我错过了什么?


< / F>


Fredrik Lundh写道:

Diez B. Roggisch写道:

当你尝试时,优势变得明显例如比较

Angelina Jolie

与<< AngelinaJolei'



Bob

两者的l-dist为3


distance(" Angelina Jolie",AngelinaJolei)3段距离(Angelina Jolie,Bob)


13

我错过了什么?




嗯。我错过了一些东西 - 1在3之前13,当我在运行示例后查看我的

终端时。并根据

http://www.reference .com / browse / wiki ... htein_distance


它有财产


"""它是总是至少两个字符串大小的差异。""


我从那里得到的实现(或者更好的来自Magnus Lie Hetland

whoms python版本在那里被引用)


所以你是对的,我的例子是废话。


但我遇到了我的正常化是有道理的 - 否则我不会这样做b
$

我想它更符合(咳嗽的例子)


" abcdef"





" abcefd"
相比

abcd


我只能说我用它来模糊地比较人们和酒店名称,并且

应用标准化使我的结果好得多。


很抱歉引起混淆。


Diez


I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
[[genre, artist, album, song]...]. I wanted to add a search function to
locate all the versions of a particular song. This is harder than you
might think. For example the Cajun "national anthem" is Jolie Blond,
but it can be spelled several different ways jolie, joli, blon, blond,
etc... In addition the various online services that provide song info
are riddled with typos, so an ordinary string match just doesn''t get
you there. What is needed is a fuzzy string match and it turns out that
there is a very good one, the Levenshtein distance, which is the number
of inserts, deletions and substitutions needed to morph one string into
another. In my application I match the desired song title against all
song titles in my list and return the ones with the lowest Levenshtein
distances. This is remarkably, one might even say stunningly,
effective, easily finding all the version of Jolie Blon in the list.

I am using the following snippet (found on the web, proper attribution
unsure), which calculates the Levenshtein distance.

def distance(a,b):
c = {}
n = len(a); m = len(b)

for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j

for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]

As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?

jab

解决方案

> As mentioned above this works quite well and I am happy with it, but I

wonder if there is a more Pythonic way of doing this type of lookup?



I did a levenshtein-fuzzy-search myself, however I enhanced my version by
normalizing the distance the following way:

def relative(a, b):
"""
Computes a relative distance between two strings. Its in the range
(0-1] where 1 means total equality.
@type a: string
@param a: arg one
@type b: string
@param b: arg two
@rtype: float
@return: the distance
"""
d = distance(a,b)
longer = float(max((len(a), len(b))))
shorter = float(min((len(a), len(b))))
r = ((longer - d) / longer) * (shorter / longer)
return r
distance is a run-of-the mill levenshtein-distance as yours.

The advantage becomes apparent when you try to e.g. compare

"Angelina Jolie"

with

"AngelinaJolei"

and

"Bob"

Both have a l-dist of 3, but the normalized distance is
0.729591836735

and

0.015306122449

respectively - making you chose the better match :)

regards,

Diez


Diez B. Roggisch wrote:

The advantage becomes apparent when you try to e.g. compare

"Angelina Jolie"

with

"AngelinaJolei"

and

"Bob"

Both have a l-dist of 3


distance("Angelina Jolie", "AngelinaJolei") 3 distance("Angelina Jolie", "Bob")


13

what did I miss ?

</F>


Fredrik Lundh wrote:

Diez B. Roggisch wrote:

The advantage becomes apparent when you try to e.g. compare

"Angelina Jolie"

with

"AngelinaJolei"

and

"Bob"

Both have a l-dist of 3


distance("Angelina Jolie", "AngelinaJolei") 3 distance("Angelina Jolie", "Bob")


13

what did I miss ?



Hmm. I missed something - the "1" before the "3" in 13 when I looked on my
terminal after running the example. And according to

http://www.reference.com/browse/wiki...htein_distance

it has the property

"""It is always at least the difference of the sizes of the two strings."""

And my implementation I got from there (or better from Magnus Lie Hetland
whoms python version is referenced there)

So you are right, my example is crap.

But I ran into cases where my normalizing made sense - otherwise I wouldn''t
have done it :)

I guess it is more along the lines of (coughed up example)

"abcdef"

compared to

"abcefd"

"abcd"

I can only say that I used it to fuzzy-compare people''s and hotel names, and
applying the normalization made my results by far better.

Sorry to cause confusion.

Diez


这篇关于模糊查找的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆