查找重复邮寄地址的策略 [英] strategies for finding duplicate mailing addresses

查看:194
本文介绍了查找重复邮寄地址的策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在试图提出一种基于相似性分数查找重复地址的方法。考虑这些重复地址:

I'm trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses:

addr_1 = '# 3 FAIRMONT LINK SOUTH'
addr_2 = '3 FAIRMONT LINK S'

addr_3 = '5703 - 48TH AVE'
adrr_4 = '5703- 48 AVENUE'

我正在计划应用一些字符串转换来缩短长字,如NORTH - > N,删除所有空格,逗号和破折号和磅符号。现在,有了这个输出,我如何比较addr_3和其他地址并检测类似?相似度的百分比是多少?你可以提供一个简单的python代码吗?

I'm planning on applying some string transformation to make long words abbreviated, like NORTH -> N, remove all spaces, commas and dashes and pound symbols. Now, having this output, how can I compare addr_3 with the rest of addresses and detect similar? What percentage of similarity would be safe? Could you provide a simple python code for this?

addr_1 = '3FAIRMONTLINKS'
addr_2 = '3FAIRMONTLINKS'

addr_3 = '570348THAV'
adrr_4 = '570348AV'

感谢

Eduardo

推荐答案

首先,通过折叠简化地址字符串所有空格到每个单词之间的单个空格,并强制一切都是小写(如果你愿意的话):

First, simplify the address string by collapsing all whitespace to a single space between each word, and forcing everything to lower case (or upper case if you prefer):

adr = " ".join(adr.tolower().split())

然后,在第41街中的st或第42街中的nd之类的东西,例如:

Then, I would strip out things like "st" in "41st Street" or "nd" in "42nd Street":

adr = re.sub("1st(\b|$)", r'1', adr)
adr = re.sub("([2-9])\s?nd(\b|$)", r'\1', adr)

请注意,第二个sub()将与一个2和nd之间的空间,但我没有设置第一个这样做;因为我不知道你如何能够区分41 St Ave和41 St(第二个是41 Street的缩写)。

Note that the second sub() will work with a space between the "2" and the "nd", but I didn't set the first one to do that; because I'm not sure how you can tell the difference between "41 St Ave" and "41 St" (that second one is "41 Street" abbreviated).

确保阅读重新模块的所有帮助;它的功能非常强大但很神秘。

Be sure to read all the help for the re module; it's powerful but cryptic.

然后,我将把你剩下的内容分成单词列表,并应用Soundex算法来列出不像数字的项目:

Then, I would split what you have left into a list of words, and apply the Soundex algorithm to list items that don't look like numbers:

http://en.wikipedia.org/wiki/Soundex

http://wwwhomes.uni-bielefeld.de/gibbon/Forms/Python/SEARCH/soundex.html

adrlist = [word if word.isdigit() else soundex(word) for word in adr.split()]

然后,您可以使用列表或将其加入到字符串中想最好的。

Then you can work with the list or join it back to a string as you think best.

Soundex的整个想法是处理错误拼写地址。这可能不是你想要的,在这种情况下,只是忽略这个Soundex的想法。

The whole idea of the Soundex thing is to handle misspelled addresses. That may not be what you want, in which case just ignore this Soundex idea.

祝你好运。

这篇关于查找重复邮寄地址的策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆