模糊字符串比较 [英] Fuzzy string comparison

查看:169
本文介绍了模糊字符串比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个模块来对字符串进行模糊比较。我有2个

项主文件应该是相同的,但是它们有数千条记录,其中项目编号不匹配各种颜色

方式。一个可能包括'' - ''或者有前导零,或者缺少一个

字符,或者输入一个字母为''O'的零。那种东西。
的东西。这些表当前驻留在mysql数据库中。我是b / b
想知道是否有一个好的包让我比较字符串和

返回一个值来衡量它们的相似性。有点像

soundex但对于不是单词的字符串。


谢谢,

Steve Bergman

I''m looking for a module to do fuzzy comparison of strings. I have 2
item master files which are supposed to be identical, but they have
thousands of records where the item numbers don''t match in various
ways. One might include a ''-'' or have leading zeros, or have a single
character missing, or a zero that is typed as a letter ''O''. That kind
of thing. These tables currently reside in a mysql database. I was
wondering if there is a good package to let me compare strings and
return a value that is a measure of their similarity. Kind of like
soundex but for strings that aren''t words.

Thanks,
Steve Bergman

推荐答案

Steve Bergman写道:
Steve Bergman wrote:

我正在寻找一个模块对字符串进行模糊比较。 [...]
I''m looking for a module to do fuzzy comparison of strings. [...]



检查模块difflib,它返回两个序列之间的差异。

Check module difflib, it returns difference between two sequences.


Wojciech Mula写道:
Wojciech Mula wrote:

Steve Bergman写道:
Steve Bergman wrote:

我正在寻找一个模块进行模糊比较字符串。 [...]
I''m looking for a module to do fuzzy comparison of strings. [...]



检查模块difflib,它返回两个序列之间的差异。


Check module difflib, it returns difference between two sequences.



它用于比较文本文件,而且相对较慢。


Googlepython levenshtein 。您可能会发现这更适合在数据库中使用
错字键。


我将采取什么措施来快速开始此练习(如描述)是:


要比较两个字符串,请复制,并且:

1.去除所有空格(包括\ xA0即& nbsp;如果每个字符串的数据都在网上附近有
;也是所有 - (一般来说

剥离频繁发生无意义的标点符号)

2.从每个字符串中删除前导零

3. d = levenshtein_distance(string_a,string_b )#string_a etc是

减少的字符串,而不是原始的

4. error_metric = float(d)/ max(len(string_a),len(string_b))


如果字符串相同(

删除空格,前导零等),则error_metric将为0.0,如果完全是
,则为1.0
不同(没有共同的字符)。


....而且你不需要任何东西有点像soundex。这有点像

就像说你想乘飞机旅行有点像

Wright兄弟'' ;-)


干杯,

John

and it''s intended for comparing text files, and is relatively slow.

Google "python levenshtein". You''ll probably find this a better fit for
typoed keys in a database.

What I would do for a quick start on this exercise (as described) is:

To compare two strings, take copies, and:
1. strip out all spaces (including \xA0 i.e.   if the data has
been anywhere near the web) from each string; also all "-" (in general
strip frequently occurring meaningless punctuation)
2. remove leading zeroes from each string
3. d = levenshtein_distance(string_a, string_b) # string_a etc is the
reduced string, not the original
4. error_metric = float(d) / max(len(string_a), len(string_b))

The error_metric will be 0.0 if the strings are the same (after
removing spaces, leading zeroes, etc) and 1.0 if they are completely
different (no characters in common).

.... and you don''t want anything "kind of like soundex". That''s a bit
like saying you''d like to travel in an aeroplane "kind of like the
Wright brothers'' " ;-)

Cheers,
John


2006年周二, - 12-26 at 13:08 -0800,John Machin写道:
On Tue, 2006-12-26 at 13:08 -0800, John Machin wrote:

Wojciech Mula写道:
Wojciech Mula wrote:

Steve Bergman写道:
Steve Bergman wrote:

我正在寻找一个模块来对字符串进行模糊比较。 [...]
I''m looking for a module to do fuzzy comparison of strings. [...]



检查模块difflib,它返回两个序列之间的差异。

Check module difflib, it returns difference between two sequences.



它用于比较文本文件,而且相对较慢。


Google" python levenshtein" 。您可能会发现这更适合在数据库中使用

错字键。

[...]


and it''s intended for comparing text files, and is relatively slow.

Google "python levenshtein". You''ll probably find this a better fit for
typoed keys in a database.
[...]



使用Levenshtein距离结合剥离噪音

字符是一个良好的开端,但OP可能希望进一步采取步骤

。 OP的要求之一是识别视觉上相似的字符串,但241O(结尾处的字母O)和241X具有相同的

Levenshtein距离2410(最后的数字为零)而前者

在视觉上比后者更接近2410.


在我看来,这可以通过启动来实现使用标准

Levenshtein实施,例如 http:// hetland .org / python / distance.py

并改变行change = change + 1类似于change =

change + visual_distance(a [j-1],b [i-1])"。 visual_distance()将是一个

函数,它体现了OP的想法,即通过返回0之间的数字,可以容忍哪些字符替换为
(这两个字符是

视觉上相同)和1(两个字符完全不同)。


希望这会有所帮助,


-Carsten

Using the Levenshtein distance in combination with stripping "noise"
characters is a good start, but the OP might want to take it a step
further. One of the OP''s requirements is to recognize visually similar
strings, but 241O (Letter O at the end) and 241X have the same
Levenshtein distance from 2410 (digit zero at the end) while the former
is visually much closer to 2410 than the latter.

It seems to me that this could be achieved by starting with a standard
Levenshtein implementation such as http://hetland.org/python/distance.py
and altering the line "change = change + 1" to something like "change =
change + visual_distance(a[j-1], b[i-1])". visual_distance() would be a
function that embodies the OP''s idea of which character replacements are
tolerable by returning a number between 0 (the two characters are
visually identical) and 1 (the two characters are completely different).

Hope this helps,

-Carsten


这篇关于模糊字符串比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆