为数据表的多列分配唯一ID [英] Assign unique ID per multiple columns of data table

查看:84
本文介绍了为数据表的多列分配唯一ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为每个多列值的数据表行分配唯一的ID。让我们考虑一个简单的示例:

I would like to assign unique IDs to rows of a data table per multiple column values. Let's consider a simple example:

library(data.table)       
DT = data.table(a=c(4,2,NA,2,NA), b=c("a","b","c","b","c"), c=1:5)

    a b c
1:  4 a 1
2:  2 b 2
3: NA c 3
4:  2 b 4
5: NA c 5

我想基于列a和b生成ID,并希望获得三个ID,其中第二行和第四行ID为相同,第三行和第五行也具有相同的ID。

I'd like to generate IDs based on columns a and b and expect to get three IDs where 2nd and 4th row IDs are identical and 3rd and 5th rows have the same ID as well.

我见过两个解决方案,但每个解决方案都不太完整:

I have seen two solutions but each are slightly incomplete:

1)解决方案一需要排序的数据表,如果我们需要每多列生成ID(在我的实际应用程序中,ID是基于大约十列生成的),这将非常麻烦。

1) Solution one requires sorted data table which is very cumbersome if we need to generate IDs per many columns (in my real application, IDs are generated based on about ten columns). Can we replace cumsum function so no sorting is required?

DT$ID1 <- cumsum(!duplicated(DT[,1:2]))

2)解决方案二忽略NA值;而我想包含NA并为其分配一个组ID

2) Solution two ignores NA values; while I'd like to include NAs and assign a group ID to them

DT <- transform(DT, ID2 = as.numeric(interaction(a,b, drop=TRUE)))

我很欣赏关于如何修改任一解决方案以生成如下所示的Expected_ID。

I appreciate any suggestion on how to modify either of the solutions to generate the Expected_ID shown below.

    a b c ID1 ID2 Expected_ID
1:  4 a 1   1   1           1
2:  2 b 2   2   2           2
3: NA c 3   3  NA           3
4:  2 b 4   3   2           2
5: NA c 5   3  NA           3


推荐答案

惯用方式:

DT[, g := .GRP, by=.(a,b)]

    a b c g
1:  4 a 1 1
2:  2 b 2 2
3: NA c 3 3
4:  2 b 4 2
5: NA c 5 3

我们有理由相信这不会很快,但是事实证明,与竞争方法相比,它还不错:

There's reason to believe this won't be fast, but it turns out it's not too bad compared to competing approaches:

nv = 10
nu = 3
nr = 1e6

library(data.table)
set.seed(1)
DT = do.call(CJ, rep(list(seq_len(nu)), nv))[sample(1:.N, nr, replace=TRUE)]

cols = copy(names(DT))

# "idiomatic" .GRP
system.time(DT[, g := .GRP, by=cols])
#    user  system elapsed 
#    0.23    0.02    0.25 

# sort and count runs
oi = as.call(lapply(c("order", cols), as.name))
system.time(DT[eval(oi), go := rleidv(.SD, cols)])
#    user  system elapsed 
#     0.3     0.0     0.3

# paste 'em
system.time(DT[, gp := match(p <- do.call(paste, c(.SD, list(sep="_"))), unique(p)), .SDcols=cols])
#    user  system elapsed 
#    5.26    0.06    5.32 

# paste 'em, fact'em (@akrun's answer)
system.time(DT[, gpf := as.integer(factor(p <- do.call(paste, c(.SD, list(sep="_"))), levels = unique(p))), .SDcols=cols])
#    user  system elapsed 
#    4.74    0.08    4.82 

# check
identical(DT$g, DT$gp); identical(DT$g, DT$gpf)
uniqueN(DT, "g") == uniqueN(DT, c("g", "go"))

rleidv方法创建不同的组号,但影响相同的分组。

The rleidv way creates different group numbers, but effects the same grouping.

增大大小 nr = 5e7 的问题导致 .GRP 方法的时间增加到8s; rleidv方式为20秒;并导致R为我系统上的其他用户挂起。

Increasing the size of the problem to nr = 5e7 raised the time to 8s for the .GRP approach; 20s for the rleidv way; and led R to hang for the others on my system.

对于任何感兴趣的人,可以在R FAQ 如何根据数据框中的分组变量创建连续索引

For anyone interested, more approaches can be found in the R FAQ How to create a consecutive index based on a grouping variable in a dataframe

这篇关于为数据表的多列分配唯一ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆