邮政地址的模糊匹配 [英] Fuzzy matching of postal addresses

查看:70
本文介绍了邮政地址的模糊匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个可疑的问题并不罕见,我正在寻找是否有任何代码可以提供帮助。我用谷歌搜索没有成功。


基本上,我有两个包含邮政地址列表的数据库和

需要在两个数据库中查找匹配的地址。更多

准确地说,对于数据库中的每个地址我想在数据库B中找到一个单独的

匹配地址。


我' 90%的方式在那里,在某种意义上我有一个简单的方法

匹配数据库A中90%的地址。但额外的情况

可以很难处理!


这可能不相关,但我正在使用ZODB来存储数据库。


当前的方法是循环数据库A中的地址。然后,
识别数据库B中共享相同邮政编码的所有地址

(通常小于50)。数据库有一个映射,让我可以有效地做这个
。然后我寻找''好'的比赛。如果有一个确切的

,我宣布成功。这并不是那么有效,每个邮政编码都是
O(n ^ 2),因为我最终比较了所有可能的对。

但它对我的申请来说足够快。


问题在于寻找好的比赛。我目前正常化

地址以忽略一些不相关的问题,如案例和标点符号,

但还有其他问题。


这里只是软件没有宣布匹配的一些例子:


1 Brantwood,BEAMINSTER,DORSET,DT8 3SS

THE BEECHES 1,BRANTWOOD, BEAMINSTER,DORSET DT8 3SS

Flat 2,Bethany House,Broadwindsor Road,BEAMINSTER,DORSET,DT8 3PP

2,BETHANY HOUSE,BEAMINSTER,DORSET DT8 3PP


阁楼,老牧师,1 Clay Lane,BEAMINSTER,DORSET,DT8 3BU

PENTHOUSE平面旧堡垒1,CLAY LANE,BEAMINSTER,DORSET DT8 3BU


St John'的长老会,Shortmoor,BEAMINSTER,DORSET,DT8 3EL

THE PRESBYTERY,SHORTMOOR,BEAMINSTER,DORSET DT8 3EL


The Pinnacles,白板山,BEAMINSTER,DORSET,DT8 3SF

PINNACLES,WHITESHEET HILL,BEAMINSTER,DORSET DT8 3SF

挑战是修正上面的一些误报而没有引入误报!


任何指针都感激不尽。


-

Andrew McLean

I have a problem that is suspect isn''t unusual and I''m looking to see if
there is any code available to help. I''ve Googled without success.

Basically, I have two databases containing lists of postal addresses and
need to look for matching addresses in the two databases. More
precisely, for each address in database A I want to find a single
matching address in database B.

I''m 90% of the way there, in the sense that I have a simplistic approach
that matches 90% of the addresses in database A. But the extra cases
could be a pain to deal with!

It''s probably not relevant, but I''m using ZODB to store the databases.

The current approach is to loop over addresses in database A. I then
identify all addresses in database B that share the same postal code
(typically less than 50). The database has a mapping that lets me do
this efficiently. Then I look for ''good'' matches. If there is exactly
one I declare a success. This isn''t as efficient as it could be, it''s
O(n^2) for each postcode, because I end up comparing all possible pairs.
But it''s fast enough for my application.

The problem is looking for good matches. I currently normalise the
addresses to ignore some irrelevant issues like case and punctuation,
but there are other issues.

Here are just some examples where the software didn''t declare a match:

1 Brantwood, BEAMINSTER, DORSET, DT8 3SS
THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS

Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP
2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP

Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU
PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU

St John''s Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL
THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL

The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF
PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF

The challenge is to fix some of the false negatives above without
introducing false positives!

Any pointers gratefully received.

--
Andrew McLean

推荐答案

-----开始PGP签名消息-----

2005-01-18,Andrew McLean< sp *********** @ at-andros.demon .co.uk>写道:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2005-01-18, Andrew McLean <sp***********@at-andros.demon.co.uk> wrote:
我有一个可疑的问题并不罕见,我正在寻找是否有任何代码可以提供帮助。我用Google搜索没有成功。


我做了一些非常相似的事情(好吧,差不多相同,实际上是
),不幸的是,我在Perl [1]做过这个。 />

我可以给你一些关于这种方式的一般指示

类型的东西,但实际上不能提供任何代码,因为它是工作。

基本上,我有两个包含邮政地址列表的数据库,并且需要在两个数据库中查找匹配的地址。更确切地说,对于数据库中的每个地址,AI希望在数据库B中找到单个匹配的地址。


在我的实现中,这是完全相同的设置,数据库A

(OS / RM地址点数据)包含一个公制shitload的地址,

与来自客户端提供的地址相匹配

数据。

我90%的方式都在那里,因为我有一种简单的方法
匹配数据库A中90%的地址。但额外的情况
可能是一个痛苦的处理!


如果从非authoratative源中整理了一个(或两个)

数据集,即无法保证100%的准确性,即实际

房主。 ;)

它可能不相关,但我正在使用ZODB来存储数据库。
当前的方法是循环数据库A中的地址。然后确定数据库B中共享相同邮政编码的所有地址
(通常小于50)。数据库有一个映射,可以让我有效地做到这一点。然后我寻找''好'的比赛。如果确实有一个我宣布成功的话。这并不是有效的,每个邮政编码都是O(n ^ 2),因为我最终比较了所有可能的对。
但它足够快为我的申请。


好​​的,这是一个好的开始,首先要做的是清理数据,

特别是邮政编码。

类似的东西:(伪代码,不记得确切的python
我做的
实现)


postcode = resultarray [邮政编码]

len =长度(邮编)

for(i = 0; i< len; i ++):

#如果第四字符是数字或空格将其附加到字符串

if(i == 3&& postcode [i] =〜/(\d | \ s)/:

cleanPostcode。= postcode [i]

#如果这不是第四个字符且它是一个aplhanumeric字符

#将它附加到string

else if(postcode [i] =〜/(\w | \ d):

cleanPostcode。= postcode [i]


将所有邮政编码放入皇家邮政使用的格式。


然后搜索邮政编码;)


我接下来要做的就是拆分每个单词out fom

字段,并按特定顺序匹配(我发现开始

包含房屋名称,号码和街道名称等元素

是最好的方法。


如果一个单词匹配,它被分配一个分数(1表示完全匹配,0.7表示

a metaphone匹配和0.6对于soundex macth IIRC),当搜索

完成后,我将得到的分数除以

字元素的数量。


如果该分数高于任何一个流行分数,则在给定变量中将其设为

。如果那里有一些同样好的(坏?)匹配

然后它们被附加到一个数组上,如果没有明确的赢家

yb那么最后一次该邮政编码的记录已经过程

它吐出一个多项选择清单。


诀窍是选择一个阈值水平,低于该水平,没有匹配

投入数据库,即使他们是最好的得分。 (我使用了一个阈值

0.3


这可以改进,当前的,非常巴洛克式的perl脚本,

如果

与某些字段完全匹配,例如房屋名称,这当前会从数据数组中删除某些值。

它不会减少值结果被分割的整数

,因此有利于结果返回给定字段的几个

的完全匹配。

挑战是修复上面的一些误报而不引入误报!

任何指针都感激不尽。
I have a problem that is suspect isn''t unusual and I''m looking to see if
there is any code available to help. I''ve Googled without success.
I have done something very similar (well, near as dammit identical,
actually), unfortunately, I did this in Perl[1].

I can give you some general pointers of the way to go about this
type of thing, but can''t actually provide any code, as it is at work.
Basically, I have two databases containing lists of postal addresses and
need to look for matching addresses in the two databases. More
precisely, for each address in database A I want to find a single
matching address in database B.
In my implementation this is the exact same setup, database A
(OS/RM addresspoint data) contained a metric shitload of addresses,
with addresses to be matched against them coming from client supplied
data.
I''m 90% of the way there, in the sense that I have a simplistic approach
that matches 90% of the addresses in database A. But the extra cases
could be a pain to deal with!
There is no way to guarantee 100% accuracy if one (or both) of the
datasets is collated from a non-authoratative source, ie the actual
homeowners. ;)
It''s probably not relevant, but I''m using ZODB to store the databases. The current approach is to loop over addresses in database A. I then
identify all addresses in database B that share the same postal code
(typically less than 50). The database has a mapping that lets me do
this efficiently. Then I look for ''good'' matches. If there is exactly
one I declare a success. This isn''t as efficient as it could be, it''s
O(n^2) for each postcode, because I end up comparing all possible pairs.
But it''s fast enough for my application.
OK, this is a good start, the first thing to do is to clean the data,
especially the postcodes.

something along the lines of: (pseudocode, can''t remember the exact python
implementation I did)

postcode = resultarray[postcode]
len = length(postcode)
for (i = 0; i < len; i++):
# if the fourth character is a digit or a space append it to the string
if (i == 3 && postcode[i] =~ /(\d|\s)/:
cleanPostcode .= postcode[i]
# if this isn''t the fourth character and it is an aplhanumeric character
# append it to the string
else if (postcode[i] =~ /(\w|\d):
cleanPostcode .= postcode[i]

That puts all the postcodes into the format that the Royal Mail uses.

Then search on postcode ;)

The next thing I did was to split each individual word out fom
it field, and matched that in a specific order (I found starting
with elements such as house name, number, and street name
was the best approach).

If a word matched it was assigned a score (1 for a exact match, 0.7 for
a metaphone match and 0.6 for a soundex macth IIRC), and when the searching
was finished I took the resulting score and divided it by the number of
word elements.

If that score was higher than any of the prevous scores then it was put
in a given variable. If there where a number of equally good(bad?) matches
then they were appended onto an array, and if there was no clear winner
yb the time that the last of the records for that postcode had been process
it spat out a multiple choice list.

The trick is picking a threshold level below which no matches are
put into the DB, even if they are the best scoring. (I used a threshold
of 0.3

This can be refined, the current, extremely baroque, perl script that
does this currently drops out certain values from the data array if
there is an exact match with certain fields, such as house name.
It doesn''t reduce the value of the integer that the result is divided
by though, thus favouring results that return an exact match on a couple
of given fields.
The challenge is to fix some of the false negatives above without
introducing false positives!

Any pointers gratefully received.




希望这个是a)可以理解,并且b)有用;)


FWIW,perl脚本(我希望类似实现的python

脚本可以执行好吧)运行一个有点片状的

用户整理数据与皇家邮件地址点数据管理

a 75%命中率,一个广告ditional 5%需要用户干预,

并且尽可能接近我能够从近17,000个地址的数据集中确定> 1%误报

计数。


有了更清晰和更新的数据,我希望结果

要好得多。


[1]它仍然是我的主要语言,我没有足够使用python

在我看来就像在perl中那样容易思考;)


- -

James jamesk [at] homeric [dot] co [dot] uk


一致性是缺乏想象力的最后手段。 ;

- - Bob Hope

----- BEGIN PGP SIGNATURE -----

版本:GnuPG v1.2.5( GNU / Linux)


iD8DBQFB7FurqfSmHkD6LvoRAgdVAJ4t2HCaT52qbuqyT5yN59 X + az0ZQwCfZgOH

L5nTnPj + TF95Z + FCM65CzV0 =

= UkeW

-----结束PGP SIGNATURE -----



Hope this is a) understandable, and b) useful ;)

FWIW, the perl script (an I would expect a similarly implemented python
script to perform about as well) running a somewhat flaky set of
user collated data against the Royal Mail Addresspoint data managed
a 75% hit rate, with an additional 5% requiring user intervention,
and as near as I have been able to ascertain a >1% false positive
count, from a dataset of nearly 17,000 addresses.

With cleaner and more up to date data I would expect the results
to be noticably better.

[1] It is still my main language, I don''t use python enough to
think in it as easily as I think in perl ;)

- --
James jamesk[at]homeric[dot]co[dot]uk

"Consistency is the last resort of the unimaginative."
- -- Bob Hope
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFB7FurqfSmHkD6LvoRAgdVAJ4t2HCaT52qbuqyT5yN59 X+az0ZQwCfZgOH
L5nTnPj+TF95Z+FCM65CzV0=
=UkeW
-----END PGP SIGNATURE-----


Andrew McLean写道:
Andrew McLean wrote:
问题是为了好比赛。我目前正常化
地址以忽略一些不相关的问题,如案例和标点符号,
但还有其他问题。



我会做更广泛的规范化。首先,通过邮政编码剥离城市

(例如你的

例子中的'Beaminster,Dorset,DT8 3SS')。在剩余的字符串中,删除所有标点符号和单词

,如the,flat等。

这里只是一些软件没有的例子'' t宣布一场比赛:


他们如何看待我上面提出的转变:

1 Brantwood,BEAMINSTER,DORSET,DT8 3SS
THE BEECHES 1,BRANTWOOD,BEAMINSTER,DORSET DT8 3SS


1 Brantwood

BEECHES 1 BRANTWOOD

Flat 2,Bethany House,Broadwindsor Road,BEAMINSTER,DORSET,DT8 3PP
2,BETHANY HOUSE,BEAMINSTER,DORSET DT8 3PP


2 Bethany House Broadwindsor Road

2 BETHANY HOUSE

阁楼,旧牧师,1 Clay Lane,BEAMINSTER,DORSET,DT8 3BU
PENTHOUSE平面旧堡垒1,CLAY LANE,BEAMINSTER,DORSET DT8 3BU


阁楼旧牧师1 Clay Road

PENTHOUSE OLD VICARAGE 1 CLAY LANE

St John'的长老会,Shortmoor,BEAMINSTER,DORSET,DT8 3EL
THE PRESBYTERY,SHORTMOOR,BEAMINSTER,DORSET DT8 3EL


St Johns Presbytery Shortmoor

PRESBYTERY SHORTMOOR

The Pinnacles ,白板山,BEAMINSTER,DORSET,DT8 3SF
PINNACLES,WHITESHEET HILL,BEAMINSTER,DORSET DT8 3SF
The problem is looking for good matches. I currently normalise the
addresses to ignore some irrelevant issues like case and punctuation,
but there are other issues.

I''d do a bit more extensive normalization. First, strip off the city
through postal code (e.g. ''Beaminster, Dorset, DT8 3SS'' in your
examples). In the remaining string, remove any punctuation and words
like "the", "flat", etc.
Here are just some examples where the software didn''t declare a match:
And how they''d look after the transformation I suggest above:
1 Brantwood, BEAMINSTER, DORSET, DT8 3SS
THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS
1 Brantwood
BEECHES 1 BRANTWOOD
Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP
2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP
2 Bethany House Broadwindsor Road
2 BETHANY HOUSE
Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU
PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU
Penthouse Old Vicarage 1 Clay Lane
PENTHOUSE OLD VICARAGE 1 CLAY LANE
St John''s Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL
THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL
St Johns Presbytery Shortmoor
PRESBYTERY SHORTMOOR
The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF
PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF




Pinnacles白板山

PINNACLES WHITESHEET HILL

显然,这并不完美,但它更接近。在这一点上,你可能会说,如果任何一个字符串是另一个字符串的子字符串,那么你可以说,b / b
你有匹配。这应该适用于所有这些例子,除了

最后一个。您可以对所有地址执行此操作,以查找所有地址

查找,或者您只能为那些在简单方式中找不到匹配项的人执行此操作。无论哪种方式,您都可以存储数据库B的
预先广告地址,这样您就不需要经常重新计算

那些。我不能肯定这将在假的
积极部门中表现如何,但我希望它不会太糟糕。


对于更详细的匹配,您可能会考虑找到一个算法

来确定距离。在两个字符串之间使用它来获得可能的匹配。


Jeff Shannon

技术人员/程序员

Credit International



Pinnacles White Sheet Hill
PINNACLES WHITESHEET HILL
Obviously, this is not perfect, but it''s closer. At this point, you
could perhaps say that if either string is a substring of the other,
you have a match. That should work with all of these examples except
the last one. You could either do this munging for all address
lookups, or you could do it only for those that don''t find a match in
the simplistic way. Either way, you can store the Database B''s
pre-munged address so that you don''t need to constantly recompute
those. I can''t say for certain how this will perform in the false
positives department, but I''d expect that it wouldn''t be too bad.

For a more-detailed matching, you might look into finding an algorithm
to determine the "distance" between two strings and using that to
score possible matches.

Jeff Shannon
Technician/Programmer
Credit International




" Andrew McLean" < SP *********** @ at-andros.demon.co.uk>在消息中写道

news:96 ************** @ at-andros.demon.co.uk ...

"Andrew McLean" <sp***********@at-andros.demon.co.uk> wrote in message
news:96**************@at-andros.demon.co.uk...
I有一个可疑的问题并不罕见,我正在寻找是否有任何代码可以提供帮助。我用Google搜索没有成功。


我不知道任何公开可用的代码。

做好地址匹配的公司

该代码作为与

皇冠珠宝同等的竞争优势。

基本上,我有两个包含邮政地址列表的数据库和
需要在两个数据库中查找匹配的地址。更准确地说,对于数据库中的每个地址,AI想要在数据库B中找到一个匹配的地址。

我90%的方式都在那里,从某种意义上说我有一个简单的方法
匹配数据库A中90%的地址。但额外的情况可能是一个痛苦的处理!


从一个纯粹务实的角度来看,这是一次性的,你需要处理多少次非
非比赛?如果答案是肯定的,

并不是那么多,我会手工完成剩下的工作。

它可能不相关,但我''使用ZODB存储数据库。


我怀疑它是否相关。

当前的方法是循环数据库A中的地址。然后
识别所有地址在数据库B中共享相同的邮政编码
(通常少于50)。数据库有一个映射,可以让我有效地完成这项工作。然后我寻找''好'的比赛。如果只有一个我宣布成功。这并不是有效的,每个邮政编码都是O(n ^ 2),因为我最终比较了所有可能的对。但它对我的申请来说足够快。

问题在于寻找好的匹配。我目前正常化
地址以忽略一些不相关的问题,如案例和标点符号,但还有其他问题。


我曾经在一个有相当不错的地址的系统上工作

匹配例行程序。正如您所怀疑的那样,关键问题是正常化。

你还不够远。你在这里也遇到了一个问题,那就是在美国这个名叫建筑物里面没有b $ b。

这里只是一些软件没有的例子。宣布比赛:

1 Brantwood,BEAMINSTER,DORSET,DT8 3SS
THE BEECHES 1,BRANTWOOD,BEAMINSTER,DORSET DT8 3SS


第一个line是一个街道地址,第二个是一个名为的建筑物和一个

街道

没有门牌号码。没有办法匹配这个,除非你知道

海滩没有平坦(或房间等)的数字并且可以移动

1成为街道地址。另一方面,这似乎是您数据库中的一致问题 - 在美国,街道地址必须与街道名称相关联。
。两者之间不允许逗号。

Flat 2,Bethany House,Broadwindsor Road,BEAMINSTER,DORSET,DT8 3PP
2,BETHANY HOUSE,BEAMINSTER,DORSET DT8 3PP


第一个是平面,房屋名称和街道名称,第二个是数字

和房屋名称。假设英国邮政标准不允许在邮政编码中使用多于一个命名的建筑物,如果你做好标准化工作,这很容易匹配

。 br />
顶层公寓,旧牧师,1 Clay Lane,BEAMINSTER,DORSET,DT8 3BU
PENTHOUSE平底船旧城区1,CLAY LANE,BEAMINSTER,DORSET DT8 3BU


这里的问题是使用扁平字样。和该拆分单位

名称和房屋名称。然后门牌号码是错误的

部分 - 它应该与街道名称一致。请参阅上面的评论。
St John'的长老会,Shortmoor,BEAMINSTER,DORSET,DT8 3EL
THE PRESBYTERY,SHORTMOOR,BEAMINSTER,DORSET DT8 3EL


这个可能无法解决,除非只有一个房屋名称

带有长老会。在邮政编码中。请注意,the正常化时可能会掉落


尖峰,白板山,BEAMINSTER,DORSET,DT8 3SF
PINNACLES,WHITESHEET HILL,BEAMINSTER,DORSET DT8 3SF


需要进行拼写更正。

挑战是修复上面的一些误报而不引入误报!

任何指针都感激不尽。


另一方面,如果这是一个重复的问题,只是为了让人头疼,我会深入研究。商业地址更正

软件。在美国,有许多供应商都有这样的软件,可以根据USPS的标准来修正地址。

他们也拥有所有合法的数据库每个

邮政编码中的地址。他们是群发邮件的附属品,并且它们存在

,因为USPS基于

的好号给出了大量邮寄折扣。你给他们的地址。


我不知道英国的情况如何,但我会感到惊讶

如果没有'一些可用的地址数据库,商业

或免费,可能作为邮政服务的附属物。


顺便提一下,顺便提一下可能是我看的第一个地方。

邮政服务主要关注的是他们可以在没有太多麻烦的情况下提供
的地址。


另一个地方是谷歌。前两页使用地址

匹配软件给了两个英国参考文献,以及几个澳元参考文献。


John Roth
-
Andrew McLean
I have a problem that is suspect isn''t unusual and I''m looking to see if
there is any code available to help. I''ve Googled without success.
There isn''t any publically availible code that I''m aware of.
Companies that do a good job of address matching regard
that code as a competitive advantage on a par with the
crown jewels.
Basically, I have two databases containing lists of postal addresses and
need to look for matching addresses in the two databases. More precisely,
for each address in database A I want to find a single matching address in
database B.

I''m 90% of the way there, in the sense that I have a simplistic approach
that matches 90% of the addresses in database A. But the extra cases could
be a pain to deal with!
From a purely pragmatic viewpoint, is this a one-off, and how many
non-matches do you have to deal with? If the answers are yes,
and not all that many, I''d do the rest by hand.
It''s probably not relevant, but I''m using ZODB to store the databases.
I doubt if it''s relevant.
The current approach is to loop over addresses in database A. I then
identify all addresses in database B that share the same postal code
(typically less than 50). The database has a mapping that lets me do this
efficiently. Then I look for ''good'' matches. If there is exactly one I
declare a success. This isn''t as efficient as it could be, it''s O(n^2) for
each postcode, because I end up comparing all possible pairs. But it''s
fast enough for my application.

The problem is looking for good matches. I currently normalise the
addresses to ignore some irrelevant issues like case and punctuation, but
there are other issues.
I used to work on a system that had a reasonably decent address
matching routine. The critical issue is, as you suspected, normalization.
You''re not going far enough. You''ve also got an issue here that doesn''t
exist in the States - named buildings.

Here are just some examples where the software didn''t declare a match:

1 Brantwood, BEAMINSTER, DORSET, DT8 3SS
THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS
The first line is a street address, the second is a named building and a
street
without a house number. There''s no way of matching this unless you know
that The Beaches doesn''t have flat (or room, etc.) numbers and can move the
1 to being the street address. On the other hand, this seems to be a
consistent problem in your data base - in the US, the street address must
be associated with the street name. No comma is allowed between the two.
Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP
2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP
The first is a flat, house name and street name, the second is a number
and a house name. Assuming that UK postal standards don''t allow
more than one named building in a postal code, this is easily matchable
if you do a good job of normalization.
Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU
PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU
The issue here is to use the words "flat" and "the" to split the flat
name and the house name. Then the house number is in the wrong
part - it shoud go with the street name. See the comment above.
St John''s Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL
THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL
This one may not be resolvable, unless there is only one house name
with "presbytery" in it within the postal code. Notice that "the" should
probably be dropped when normalizing.
The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF
PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF
Spelling correction needed.
The challenge is to fix some of the false negatives above without
introducing false positives!

Any pointers gratefully received.
If, on the other hand, this is a repeating problem that''s simply going
to be an ongoing headache, I''d look into commercial address correction
software. Here in the US, there are a number of vendors that have
such software to correct addresses to the standards of the USPS.
They also have data bases of all the legitimate addresses in each
postal code. They''re adjuncts of mass mailers, and they exist
because the USPS gives a mass mailing discount based on the
number of "good" addresses you give them.

I don''t know what the situation is in the UK, but I''d be surprised
if there wasn''t some availible address data base, either commercial
or free, possibly as an adjunct of the postal service.

The later, by the way, is probably the first place I''d look. The
postal service has a major interest in having addresses that they
can deliver without a lot of hassle.

Another place is google. The first two pages using "Address
Matching software" gave two UK references, and several
Australian references.

John Roth
--
Andrew McLean






这篇关于邮政地址的模糊匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆