R:检查一组变量是否构成唯一索引 [英] R: Checking if a set of variables forms a unique index
问题描述
我有一个大的数据框,我想检查一组(因子)变量的值是否唯一标识数据的每一行。
I have a large dataframe and I want to check whether the values a set of (factor) variables uniquely identifies each row of the data or not.
我的当前策略是通过变量聚合,我认为是索引变量
My current strategy is to aggregate by the variables that I think are the index variables
dfAgg = aggregate(dfTemp$var1, by = list(dfTemp$var1, dfTemp$var2, dfTemp$var3), FUN = length)
stopifnot(sum(dfAgg$x > 1) == 0)
但这个策略永远。
感谢。
推荐答案
data.table
包提供非常快速的重复
和独特
data.table
的方法。它还有一个 by =
参数,您可以在其中提供计算重复/唯一结果的列。
The data.table
package provides very fast duplicated
and unique
methods for data.table
s. It also has a by=
argument where you can provide the columns on which the duplicated/unique results should be computed from.
以下是大型data.frame的示例:
Here's an example of a large data.frame:
require(data.table)
set.seed(45L)
## use setDT(dat) if your data is a data.frame,
## to convert it to a data.table by reference
dat <- data.table(var1=sample(100, 1e7, TRUE),
var2=sample(letters, 1e7, TRUE),
var3=sample(as.numeric(sample(c(-100:100, NA), 1e7,TRUE))))
system.time(any(duplicated(dat)))
# user system elapsed
# 1.632 0.007 1.671
使用 anyDuplicated.data.frame
需要25秒。
# if you want to calculate based on just var1 and var2
system.time(any(duplicated(dat, by=c("var1", "var2"))))
# user system elapsed
# 0.492 0.001 0.495
使用 anyDuplicated.data.frame
需要7.4秒。
这篇关于R:检查一组变量是否构成唯一索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!