R RecordLinkage 标识 [英] R RecordLinkage Identity

查看:60
本文介绍了R RecordLinkage 标识的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在 R 中使用 RecordLinkage 库.我有一个包含 ID、姓名、电话、邮件的数据框

I am working with RecordLinkage Library in R. I have a data frame with id, name, phone, mail

我的代码如下所示:

ids = data$id
pairs = compare.dedup(data, identity=ids, blockfld=as.list(2,3,4))

问题是我的 id 在我的结果输出中不一样所以如果我有这些数据:

The problem is that my ids are not the same in my result output so if I had this data:

id   Name     Phone    Mail
233  Nathali  2222     nathali@dd.com
435  Nathali  2222 
553  Jean     3444     jean@dd.com

在我的结果输出中,我会有类似的东西

In my result output I will have something like

id1 id2
1   2

代替

id1 id2
233 435 

我想知道是否有办法保留 id 而不是索引,或者有人可以向我解释身份参数.

I want to know if there is a way to keep the ids instead of the index, or someone could explain me the identity parameter.

谢谢

推荐答案

标识向量告诉 getPairs 方法哪些输入记录属于同一实体.它实际上保存了您通常希望从记录链接中获得的信息,即您有几条记录并且事先知道它们中的哪些属于一起.但是,当您使用训练集校准方法或想要评估记录链接方法的准确性时(该包主要为此目的而编写),您可以从已删除重复数据或链接的数据集开始.

The identity vector tells the getPairs method which of the input records belong to the same entity. It actually holds information that you usually want to gain from record linkage, i.e. you have a couple of records and do not know in advance which of them belong together. However, when you use a training set to calibrate a method or you want to evaluate the accurateness of record linkage methods (the package was mainly written for this purpose), you start with an already deduplicated or linked data set.

在您的示例中,前两行(ids 233、435)显然表示同一个人,而第三行表示不同的人.因此,一个有意义的身份向量将是:

In your example, the first two rows (ids 233, 435) obviously mean the same person and the third row a different one. A meaningful identity vector would therefore be:

c(1,1,2)

但也可能是:

c(42,42,128)

只需确保标识向量在对应表行保存匹配记录的位置具有完全相同的值(向量索引 = 行索引).

Just make sure that the identity vector has identical values exactly at those positions where the corresponding table rows hold matching record (vector index = row index).

关于如何在结果中显示 id 的问题:您可以获得完整的记录对,包括所有数据字段,使用(请参阅文档了解更多详细信息):

About your question on how to display the ids in the result: You can get the full record pairs, including all data fields, with (see the documentation for more details):

getPairs(pairs)

可能有更好的方法来获取原始 ID,具体取决于您进一步处理记录对的方式(例如运行分类算法).如果您需要更多建议,请扩展您的示例.

There might be better ways to get hold of the original ids, depending on how you further process the record pairs (e.g. running a classification algorithm). Extend your example if you need more advice on this.

ps:我是软件包作者之一.我直到最近才意识到人们在 Stack Overflow 上询问有关包的问题,​​所以请原谅几个问题已经存在很长时间了.我将寻找一种方法来就此处发布的新问题获得通知,但我还要提一下,人们可以通过包裹信息中列出的电子邮件地址之一直接与我们联系.

p.s.: I am one of the package authors. I have only very recently become aware that people ask questions about the package on Stack Overflow, so please excuse that a couple of questions have been around unanswered for a long time. I will look for a way to get notified on new questions posted here, but I would also like to mention that people can contact us directly via one of the email addresses listed in the package information.

这篇关于R RecordLinkage 标识的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆