R-简单记录链接-下一步? [英] R - simple Record Linkage - the next step ?

查看:117
本文介绍了R-简单记录链接-下一步?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试与库( RecordLinkage)进行一些简单的直接链接。

I am trying to do some simple direct linkage with the library('RecordLinkage').

所以我只有一个向量

tv3 = c("TOURDEFRANCE", 'TOURDEFRANCE', "TOURDE FRANCE", 
"TOURDE FRANZ", "GET FRESH") 

我需要的功能是 library('RecordLinkage') compare.dedup get:

The function that I need is compare.dedup of the library('RecordLinkage') and I get :

compare.dedup(as.data.frame(tv3))$pairs

$pairs
id1 id2 tv3 is_match
1    1   2   1       NA
2    1   3   0       NA
3    1   4   0       NA
4    1   5   0       NA
5    2   3   0       NA
....

我在查找文档方面遇到困难下一步。然后,我如何比较并找到我的相似对?

I have trouble finding documentation for the next step. How do I then compare and find my similar pair ?

所以我发现了距离 jarowinkler(),但它只返回对。基本上,您只能一一完成 jarowinkler(tv3 [1],tv3)

So I found the distance jarowinkler() but it returns only pairs. Basically, you can only do jarowinkler(tv3[1], tv3) one by one.

所以我问:您是否需要做自己的循环以获得结果,还是从 compare.dedup中找到更直接的方法? 函数?

So I am asking : do you need to do your own loop to get your result or is there a more direct way from the compare.dedup function ?

mat = matrix(0, length(tv3), length(tv3))

for(j in 1:length(tv3)){
  for(i in 1:length(tv3)){
    { mat[i,j] = jarowinkler(tv3[j], tv3[i]) }
  }
}

差异矩阵

> mat
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 1.0000000 1.0000000 0.9846154 0.9333333 0.5240741
[2,] 1.0000000 1.0000000 0.9846154 0.9333333 0.5240741
[3,] 0.9846154 0.9846154 1.0000000 0.9525641 0.5133903
[4,] 0.9333333 0.9333333 0.9525641 1.0000000 0.5240741
[5,] 0.5240741 0.5240741 0.5133903 0.5240741 1.0000000

我想做的只是相似对象( TOURDEFRANCE, TOURDEFRANCE, TOURDE FRANCE, TOURDE FRANZ ),可能是相似对象名称之一。

What I want to do is simply attribute for similar object ("TOURDEFRANCE", 'TOURDEFRANCE', "TOURDE FRANCE", "TOURDE FRANZ"), one of the possible similar object names.

如何在我的相异矩阵上设置一个临界值,例如 0.90 $ c> retreive 相似对象的所有行?

How could I set a cut-off, let's say 0.90, on my dissimilarity matrix and then retreive all the rows of the similar object ?

如果我的数据在数据框中

If my data are in a dataframe

             tv3
1  TOURDEFRANCE
2  TOURDEFRANCE
3 TOURDE FRANCE
4  TOURDE FRANZ
5     GET FRESH

执行类似 截止>的操作。 0.90 并检索相应的行?

Do something like which cut-off > 0.90 and retreive the corresponding rows ?

非常欢迎对此简单的记录链接提供任何帮助!

Any help for this simple Record Linkage is very welcome !

推荐答案

来自此帖子,这是一个适合您的示例:

Taken from this post, here's an example that should work for you:

tv3 = as.data.frame(c("TOURDEFRANCE", 'TOURDEFRANCE', "TOURDE FRANCE", 
    "TOURDE FRANZ", "GET FRESH"))
colnames(tv3) <- "name"

tv3 %>% compare.dedup(strcmp = TRUE) %>%
        epiWeights() %>%
        epiClassify(0.5) %>%
        getPairs(show = "links", single.rows = TRUE) -> matches

结果,匹配项数据框应帮助您确定阈值(在 epiClassify()中设置)。

In result, the matches dataframe should help you determining thresholds (set in epiClassify()).

这篇关于R-简单记录链接-下一步?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆