使用Soundex,Jaro Winkler和Edit Distance(UTL_MATCH)匹配Oracle复制列值 [英] Matching Oracle duplicate column values using Soundex, Jaro Winkler and Edit Distance (UTL_MATCH)

查看:101
本文介绍了使用Soundex,Jaro Winkler和Edit Distance(UTL_MATCH)匹配Oracle复制列值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找到一种可靠的方法来匹配数据库中重复的人员记录.数据存在一些严重的数据质量问题,我也正在尝试解决这些问题,但是在我不得不批准之前,我一直牢牢掌握着已有的数据.

I am trying to find a reliable method for matching duplicate person records within the database. The data has some serious data quality issues which I am also trying to fix but until I have the go-ahead to do so I am stuck with the data I have got.

我可以使用的表格列是:

The table columns available to me are:

SURNAME       VARCHAR2(43)
FORENAME      VARCHAR2(38)
BIRTH_DATE    DATE
ADDRESS_LINE1 VARCHAR2(60)
ADDRESS_LINE2 VARCHAR2(60)
ADDRESS_LINE3 VARCHAR2(60)
ADDRESS_LINE4 VARCHAR2(60)
ADDRESS_LINE5 VARCHAR2(60)
POSTCODE      VARCHAR2(15)

SOUNDEX函数对此用途相对有限,但是UTL_MATCH包似乎可以使用Jaro Winker算法提供更好的匹配级别.

The SOUNDEX function is relatively limited for this use but the UTL_MATCH package seems to offer a better level of matching using the Jaro Winker algorithm.

没有人重新发明轮子,没有人实施一种可靠的方法来匹配这种类型的数据吗?

Rather than re-inventing the wheel, has anyone implemented a reliable method for matching this type of data?

要解决的数据质量问题:

Data Quality issues to contend with:

  1. 邮政编码虽然是必填项,但并非总是完整输入.
  2. 地址数据质量相对较差,没有以固定格式输入地址(例如,某些行1可能是"Flat 1",而有些行1可能是"Flat1,22 Acacia Ave").
  3. 地名"列可以包含缩写,全名或有时不止一个地名.

例如,我正在考虑:

连接所有地址字段,并将Jaro Winkler算法应用于完整地址,再结合对串联在一起的全名进行类似的测试.

Concatenating all address fields and applying the Jaro Winkler algorithm to the full address combined with a similar test of the full name concatenated together.

可以直接比较匹配的出生日期,但是由于数据量巨大,仅匹配此数据是不够的.

The birth date can be compared directly for a match but due to the large volume of data just matching upon this isn't enough.

Oracle 10g R2企业版.

Oracle 10g R2 Enterprise Edition.

欢迎任何有用的建议.

Any helpful suggestions welcome.

推荐答案

我正在尝试找到一种可靠的方法来匹配重复的人 数据库中的记录."

"I am trying to find a reliable method for matching duplicate person records within the database."

A,没有这样的事情.您最希望得到的是一个带有合理怀疑元素的系统.

Alas there is no such thing. The most you can hope for is a system with a reasonable element of doubt.

SQL> select n1
       , n2
       , soundex(n1) as sdx_n1
       , soundex(n2) as sdx_n2
       , utl_match.edit_distance_similarity(n1, n2) as ed
       , utl_match.jaro_winkler_similarity(n1, n2) as jw   
from t94
order by n1, n2
/


  2    3    4    5    6    7    8    9  
N1                   N2                   SDX_ SDX_         ED         JW
-------------------- -------------------- ---- ---- ---------- ----------
MARK                 MARKIE               M620 M620         67         93
MARK                 MARKS                M620 M620         80         96
MARK                 MARKUS               M620 M622         67         93
MARKY                MARKIE               M620 M620         67         89
MARSK                MARKS                M620 M620         60         95
MARX                 AMRX                 M620 A562         50         91
MARX                 M4RX                 M620 M620         75         85
MARX                 MARKS                M620 M620         60         84
MARX                 MARSK                M620 M620         60         84
MARX                 MAX                  M620 M200         75         93
MARX                 MRX                  M620 M620         75         92

11 rows selected.

SQL> SQL> SQL> 

SOUNDEX的最大优点是它可以对字符串进行标记化.这意味着它为您提供了可以被索引的索引:当涉及大量数据时,这是非常有价值的.另一方面,它又旧又粗糙.周围有更新的算法,例如Metaphone和Double Metaphone.您应该可以通过Google找到它们的PL/SQL实现.

The big advantage of SOUNDEX is that it tokenizes the string. This means it gives you something which can be indexed: this is incredibly valuable when it comes to large amounts of data. On the other hand it is old and crude. There are newer algorithms around, such as Metaphone and Double Metaphone. You should be able to find PL/SQL implemenations of them via Google.

评分的优点是它们允许一定程度的模糊性;因此您可以找到所有行where name_score >= 90%.压倒性的缺点是分数是相对的,因此您无法为它们编制索引.这种比较会杀死您.

The advantage of scoring is that they allow for a degree of fuzziness; so you can find all rows where name_score >= 90%. The crushing disadvantage is that the scores are relative and so you cannot index them. This sort of comparison kills you with large volumes.

这是什么意思:

  1. 您需要多种策略.没有任何一种算法可以解决您的问题.
  2. 数据清理很有用.比较MARX与MRX和M4RX的得分:从名称中删除数字可以提高命中率.
  3. 您无法即时为大量名称评分.如果可以,请使用标记和预计分.如果您的客户流失率不高,请使用缓存.如果可以承受的话,请使用分区.
  4. 使用Oracle文本(或类似文本)构建昵称和变体词库.
  5. Oracle 11g向Oracle Text引入了特定的名称搜索功能. 了解更多信息.
  6. 建立一个标准名称表以进行评分并将实际数据记录链接到该表.
  7. 使用其他数据值,尤其是可索引的数据值(例如出生日期)来预先过滤大量名称或增加对拟议匹配项的信心.
  8. 请注意,其他数据值也有其自身的问题:是在2011年1月31日出生的人,年龄是11个月还是80岁?
  9. 请记住,名称是棘手的,尤其是当您必须考虑已被罗马化的名称时:有超过四百种不同的拼写Moammar Khadaffi(用罗马字母拼写)的方式-甚至Google都无法同意哪种变体是最规范的.
  1. You need a mix of strategies. No single algorithm will solve your problem.
  2. Data cleansing is useful. Compare the scores for MARX vs MRX and M4RX: stripping numbers out of names improves the hit rate.
  3. You cannot score big volumes of names on the fly. Use tokenizing and pre-scoring if you can. Use caching if you don't have a lot of churn. Use partitioning if you can afford it.
  4. Use a Oracle Text (or similar) to build a thesaurus of nicknames and variants.
  5. Oracle 11g introduced specific name search functionality to Oracle Text. Find out more.
  6. Build a table of canonical names for scoring and link actual data records to that.
  7. Use other data values, especially indexable ones like date of birth, to pre-filter large volumes of names or to increase confidence in proposed matches.
  8. Be aware that other data values come with their own problems: is someone born on 31/01/11 eleven months old or eighty years old?
  9. Remember that names are tricky, especially when you have to consider names which have been romanized: there are over four hundred different ways of spelling Moammar Khadaffi (in the roman alphabet) - and not even Google can agree on which variant is the most canonical.

以我的经验,将令牌(名字,姓氏)串联起来是一种喜忧参半的祝福.它解决了某些问题(例如道路名称是出现在地址行1还是地址行2中),但会引起其他问题:考虑对GRAHAM OLIVER和OLIVER GRAHAM评分,而不是OLIVER对OLIVER,GRAHAM对GRAHAM,OLIVER对GRAHAM和GRAHAM对OLIVER评分.

In my experience concatenating the tokens (first name, last name) is a mixed blessing. It solves certain problems (such as whether the road name appears in address line 1 or address line 2) but causes other problems: consider scoring GRAHAM OLIVER vs OLIVER GRAHAM against scoring OLIVER vs OLIVER, GRAHAM vs GRAHAM, OLIVER vs GRAHAM and GRAHAM vs OLIVER.

无论您做什么,最终还是会得到误报和错失命中率.没有任何算法可以防止打字错误(尽管Jaro Winkler在MARX vs AMRX方面做得很好).

Whatever you do you will still end up with false positives and missed hits. No algorithm is proof against typos (although Jaro Winkler did pretty good with MARX vs AMRX).

这篇关于使用Soundex,Jaro Winkler和Edit Distance(UTL_MATCH)匹配Oracle复制列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆