何时使用哪个模糊函数比较2个字符串 [英] When to use which fuzz function to compare 2 strings

查看:153
本文介绍了何时使用哪个模糊函数比较2个字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用Python学习fuzzywuzzy.

I am learning fuzzywuzzy in Python.

我了解fuzz.ratiofuzz.partial_ratiofuzz.token_sort_ratiofuzz.token_set_ratio的概念.我的问题是什么时候使用哪个功能?

I understand the concept of fuzz.ratio, fuzz.partial_ratio, fuzz.token_sort_ratio and fuzz.token_set_ratio. My question is when to use which function?

  • 我是否先检查2个字符串的长度,如果不相似则说规则 出fuzz.partial_ratio?
  • 如果两个字符串的长度相似,我将使用 fuzz.token_sort_ratio?
  • 我应该一直使用fuzz.token_set_ratio吗?
  • Do I check the 2 strings' length first, say if not similar, then rule out fuzz.partial_ratio?
  • If the 2 strings' length are similar, I'll use fuzz.token_sort_ratio?
  • Should I always use fuzz.token_set_ratio?

任何人都知道SeatGeek使用什么标准?

Anyone knows what criteria SeatGeek uses?

我正在尝试建立一个房地产网站,并考虑使用fuzzywuzzy来比较地址.

I am trying to build a real estate website, thinking to use fuzzywuzzy to compare addresses.

推荐答案

好问题.

我是SeatGeek的工程师,所以我想可以为您提供帮助.我们有一个很棒的博客文章,它很好地解释了这些差异,但我可以总结一下,并提供一些有关如何使用不同类型的见解.

I'm an engineer at SeatGeek, so I think I can help here. We have a great blog post that explains the differences quite well, but I can summarize and offer some insight into how we use the different types.

在幕后,这四种方法中的每一种都计算两个输入字符串中标记的某些顺序之间的编辑距离.这是通过difflib.ratio函数将要完成的:

Under the hood each of the four methods calculate the edit distance between some ordering of the tokens in both input strings. This is done using the difflib.ratio function which will:

返回序列相似性的度量(在[0,1]中浮动).

Return a measure of the sequences' similarity (float in [0,1]).

其中T是两个序列中元素的总数,M是 匹配数,这是2.0 * M/T.请注意,如果 序列相同,如果没有共同之处,则返回0.

Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T. Note that this is 1 if the sequences are identical, and 0 if they have nothing in common.

四种模糊方法在不同的输入字符串组合上调用difflib.ratio.

The four fuzzywuzzy methods call difflib.ratio on different combinations of the input strings.

简单.只需在两个输入字符串上调用difflib.ratio(代码).

Simple. Just calls difflib.ratio on the two input strings (code).

fuzz.ratio("NEW YORK METS", "NEW YORK MEATS")
> 96

fuzz.partial_ratio

尝试说明部分字符串匹配会更好.使用最短的字符串(长度n)对较大的字符串的所有n长度子字符串调用ratio,并返回最高分数(

fuzz.partial_ratio

Attempts to account for partial string matches better. Calls ratio using the shortest string (length n) against all n-length substrings of the larger string and returns the highest score (code).

请注意,"YANKEES"是最短的字符串(长度为7),我们对"NEW YORK YANKEES"的所有长度为7的子字符串(包括检查"YANKEES",a 100%匹配):

Notice here that "YANKEES" is the shortest string (length 7), and we run the ratio with "YANKEES" against all substrings of length 7 of "NEW YORK YANKEES" (which would include checking against "YANKEES", a 100% match):

fuzz.ratio("YANKEES", "NEW YORK YANKEES")
> 60
fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES")
> 100

fuzz.token_sort_ratio

尝试说明乱序的相似字符串.在对每个字符串中的标记进行排序之后,在两个字符串上调用ratio(

fuzz.token_sort_ratio

Attempts to account for similar strings out of order. Calls ratio on both strings after sorting the tokens in each string (code). Notice here fuzz.ratio and fuzz.partial_ratio both fail, but once you sort the tokens it's a 100% match:

fuzz.ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets")
> 45
fuzz.partial_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets")
> 45
fuzz.token_sort_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets")
> 100

fuzz.token_set_ratio

尝试排除字符串中的差异.对三个特定子字符串集调用比率并返回最大值(代码):

  1. 仅交集以及与字符串一的余数的交集
  2. 仅交集,并且与字符串二的余数相交
  3. 余数为1的交点和余数为2的交点

请注意,通过拆分两个字符串的交集和余数,我们要考虑两个字符串的相似性和差异性:

Notice that by splitting up the intersection and remainders of the two strings, we're accounting for both how similar and different the two strings are:

fuzz.ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners")
> 36
fuzz.partial_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners")
> 61
fuzz.token_sort_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners")
> 51
fuzz.token_set_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners")
> 91

应用程序

这就是魔术发生的地方.在SeatGeek上,实质上,我们为每个数据点(地点,事件名称等)的每个比率创建一个矢量评分,并使用该评分向程序决策提供特定于我们问题域的相似性.

Application

This is where the magic happens. At SeatGeek, essentially we create a vector score with each ratio for each data point (venue, event name, etc) and use that to inform programatic decisions of similarity that are specific to our problem domain.

话虽这么说,但事实并非如此,因为听起来不像FuzzyWuzzy对您的用例有用.确定两个地址是否相似将非常不利.考虑SeatGeek总部的两个可能地址:"235 Park Ave S. Floor 12"和"235 Park Ave S. Floor 12":

That being said, truth by told it doesn't sound like FuzzyWuzzy is useful for your use case. It will be tremendiously bad at determining if two addresses are similar. Consider two possible addresses for SeatGeek HQ: "235 Park Ave Floor 12" and "235 Park Ave S. Floor 12":

fuzz.ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12")
> 93
fuzz.partial_ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12")
> 85
fuzz.token_sort_ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12")
> 95
fuzz.token_set_ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12")
> 100

FuzzyWuzzy使这些字符串具有较高的匹配度,但是一个地址是我们在联合广场附近的实际办公室,另一个地址是大中央车站的另一侧.

FuzzyWuzzy gives these strings a high match score, but one address is our actual office near Union Square and the other is on the other side of Grand Central.

对于您的问题,最好使用 Google Geocoding API .

For your problem you would be better to use the Google Geocoding API.

这篇关于何时使用哪个模糊函数比较2个字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆