使用Python Dedupe库设置用于匹配记录的显式规则 [英] Setting explicit rules for matching records using Python Dedupe library

查看:430
本文介绍了使用Python Dedupe库设置用于匹配记录的显式规则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Dedupe库将人员记录彼此匹配.我的数据包括姓名,出生日期,地址,电话号码和其他个人身份信息.

I'm using the Dedupe library to match person records to each other. My data includes name, date of birth, address, phone number and other personally identifying information.

这是我的问题:例如,如果两条记录具有匹配的名称和电话号码,我总是希望100%置信度匹配它们.

Here is my question: I always want to match two records with 100% confidence if they have a matching name and phone number (for example).

这是我的一些代码的示例:

Here is an example of some of my code:

fields = [
    {'field' : 'LAST_NM', 'variable name' : 'last_nm', 'type': 'String'},
    {'field' : 'FRST_NM', 'variable name' : 'frst_nm', 'type': 'String'},
    {'field' : 'FULL_NM', 'variable name' : 'full_nm', 'type': 'Name'},
    {'field' : 'BRTH_DT', 'variable name' : 'brth_dt', 'type': 'String'},
    {'field' : 'SEX_CD', 'type': 'Exact'},
    {'field' : 'FULL_US_ADDRESS', 'variable name' : 'us_address', 'type': 'Address'},
    {'field' : 'APT_NUM', 'type': 'Exact'},
    {'field' : 'CITY', 'type': 'ShortString'},
    {'field' : 'STATE', 'type': 'ShortString'},
    {'field' : 'ZIP_CD', 'type': 'ShortString'},
    {'field' : 'HOME_PHONE', 'variable name' : 'home_phone', 'type': 'Exact'},
    {'type': 'Interaction', 'interaction variables' : ['full_nm', 'home_phone']},

在Dedupe库中,有什么方法可以让我显式地匹配两个或多个字段?根据文档,交互字段将多个变量的值相乘." ( https://dedupe.readthedocs.org/en/latest/Variable- definition.html#interaction ).我想执行一个严格的规则,使其具有100%的置信度-不仅仅是将变量的值相乘.我问的原因是,我发现Dedupe偶尔会在这两个条件下错过一些比赛(可能是由于我训练时间不够长,但不管怎么说,我只想将这些比赛硬编码到脚本中).

In the Dedupe library, is there any way for me to explicitly match two or more fields? According to the docs, "An interaction field multiplies the values of the multiple variables." (https://dedupe.readthedocs.org/en/latest/Variable-definition.html#interaction). I want to implement a strict rule that it matches with 100% confidence - not merely multiplying the values of the variables. The reason I ask is that I have found that occasionally Dedupe misses some matches on these two criteria (likely a result of me not training long enough, but regardless, I just want to hard code these matches into my script).

有什么建议吗?

推荐答案

Dedupe没有此功能,而且可能永远也不会(我是主要作者之一).如果确实是在这些字段上完全匹配意味着记录是共同引用的规则,则可以在将其余记录发送到Dedupe之前编写一些代码以明确匹配这些记录.

Dedupe does not have this feature and probably never will (I'm one of the main authors). If it's truly a rule that exact matches on these fields means that records are co-referent, you can write some code to explicitly match these before sending the rest of the records into Dedupe.

exact_matches = defaultdict(list)
for record_id, record in records.items():
    match_key = (record['name'], record['phone'])
    exact_matches[match_key].append(record_id)

partially_deduplicated = []
exact_lookup = {}
for match_group in exact_matches.values():
     head_id = match_group.pop()
     partially_deduplicated.append((head_id, records[head_id]))
     for dupe_id in match_group :
         exact_lookup[dupe_id] = head_id

这篇关于使用Python Dedupe库设置用于匹配记录的显式规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆