在几列中查找唯一的一组标识符/组 [英] Find unique set of identifiers/groups among several columns

查看:72
本文介绍了在几列中查找唯一的一组标识符/组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据带有两列(可能更多)标识符(通常是长字符串)。这些有时会有所不同,输入错误或随时间变化。我想识别数据中的唯一主题。这需要识别通过其ID在某种程度上相关联的案例组。

I have data with two (potentially more) columns of identifiers (typically long strings). These differ sometimes, are mistyped, or change over time. I want to identify unique subjects in the data. This requires identifying groups of cases which are connected via their ids at some level.

示例

df <- data.frame(ida = c("A", "B", "C", "C", "D", "E"),
                 idb = c(1, 1, 3, 4, 4, 7),
                 trueid = c("id1", "id1", "id2", "id2", "id2", "id3"))



> df
  ida idb trueid
1   A   1    id1
2   B   1    id1
3   C   3    id2
4   C   4    id2
5   D   4    id2
6   E   7    id3

id1 是 A, B,1 ,表示 id2 C, D,3、4 id3 E,7

我不知道 trueid ,但需要使用 ida列中的信息进行查找 idb

I don't know trueid but need to find it using the information from columns ida and idb.

该解决方案需要扩展到数以十万计的数百万个观测值数千个唯一ID。我已经在使用 data.table

The solution needs to scale to millions of observations with tens of thousands of unique ids. I am already using data.table.

扩展:还有另一种情况有两列以上,有些列可能对其他列有帮助,即具有相同的标识符。我不知道哪些列对哪些内容有用。我认为类型可以忽略,但是所有列都是字符串或可以安全地转换。

Extension: There is another scenario where there are more than two columns, and some columns might be informative for others, i.e. have the same identifiers. I do not know which columns are informative for which. I think type can be disregarded though, all columns are strings or can safely be converted.

另一个示例:

df <- data.frame(ida = c("A", "B", "C", "C", "D", "E"),
                 idb = c("1", "2", "3", "4", "4", "7"),
                 idc = c("1", "1", "2", "3", "4", "5"),
                 idd = c("1", "A", "2", "3", "4", "5"),
                 trueid = c("id1", "id1", "id1", "id1", "id1", "id2"))



> df
  ida idb idc idd trueid
1   A   1   1   1    id1
2   B   2   1   A    id1
3   C   3   2   2    id1
4   C   4   3   3    id1
5   D   4   4   4    id1
6   E   7   5   5    id2

编辑:正如评论者所指出的,这本质上是在图中找到完整子图的问题。阅读更多内容后,我知道可以使用 library(igraph)解决此问题。我将问题悬而未决,因为我希望使用依赖 base data.table 或<$ c的解决方案$ c> dplyr 。我无法轻松地在正在使用的服务器上安装软件包,安装 igraph 涉及很多繁琐的手续和延误。

As a commenter pointed out, this is essentially the clique problem of finding complete subgraphs in a graph. After reading a bit more, I understand this issue can be solved with library(igraph). I am leaving the question open as I would prefer a solution that relies on base, data.table or dplyr. I cannot easily install packages on the server I am using, installing igraph involves dealing with a lot of red tape and delays.

Edit2:对于阅读此书并面临类似问题的任何人: zx8754 使用igraph的答案要快得多(几个数量级)在具有更多组的更大(模拟)数据上。如果您有机会使用 igraph ,请这样做。

For anybody reading this and facing a similar problem: zx8754's answer using igraph is considerably (several orders of magnitude) faster on larger (simulated) data with more groups. If you have the chance to use igraph, do so.

推荐答案

这是使用 data.table 的递归方法:

#convert into a long format for easier processing
mDT <- melt(DT[, rn := .I], id.var="rn", variable.name="V", value.name="ID")[,
    tid := NA_integer_]

#the recursive function
link <- function(ids, label) {
    #identify the rows in DT containing ids and extract the IDs
    newids <- mDT[mDT[.(ID=ids), on=.(ID), .(rn=rn)], on=.(rn), allow.cartesian=TRUE,
        unique(ID)]

    #update those rows to the same group
    mDT[mDT[.(ID=ids), on=.(ID), .(rn=rn)], on=.(rn), tid := label]

    if (length(setdiff(newids, ids)) > 0L) {
        #call the recursive function if there are new ids
        link(newids, label)
    }
}

#get the first id that is not labelled yet
id <- mDT[is.na(tid), ID[1L]]
grp <- 1L
while(!is.na(id)) {
    #use recursive function to link them up
    link(id, grp)

    #repeat for next id that is not part of any group yet
    id <- mDT[is.na(tid), ID[1L]]
    grp <- grp + 1L
}

#update original DT with tid
DT[mDT, on=.(rn), tid := tid]

数据:

library(data.table)
DT <- data.table(ida = c("A", "B", "C", "C", "D", "E"),
    idb = c("1", "2", "3", "4", "4", "7"),
    idc = c("1", "1", "2", "3", "4", "5"),
    idd = c("1", "A", "2", "3", "4", "5"))

这篇关于在几列中查找唯一的一组标识符/组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆