R：检查一组变量是否构成唯一索引 [英] R: Checking if a set of variables forms a unique index

查看：180 发布时间：2017/3/12 11:11:55 r data.table

本文介绍了R：检查一组变量是否构成唯一索引的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个大的数据框，我想检查一组（因子）变量的值是否唯一标识数据的每一行。

I have a large dataframe and I want to check whether the values a set of (factor) variables uniquely identifies each row of the data or not.

我的当前策略是通过变量聚合，我认为是索引变量

My current strategy is to aggregate by the variables that I think are the index variables

dfAgg = aggregate(dfTemp$var1, by = list(dfTemp$var1, dfTemp$var2, dfTemp$var3), FUN = length)
stopifnot(sum(dfAgg$x > 1) == 0)

但这个策略永远。

感谢。

推荐答案

data.table 包提供非常快速的重复和独特 data.table 的方法。它还有一个 by = 参数，您可以在其中提供计算重复/唯一结果的列。

The data.table package provides very fast duplicated and unique methods for data.tables. It also has a by= argument where you can provide the columns on which the duplicated/unique results should be computed from.

以下是大型data.frame的示例：

Here's an example of a large data.frame:

require(data.table)
set.seed(45L)
## use setDT(dat) if your data is a data.frame, 
## to convert it to a data.table by reference
dat <- data.table(var1=sample(100, 1e7, TRUE), 
                 var2=sample(letters, 1e7, TRUE), 
                 var3=sample(as.numeric(sample(c(-100:100, NA), 1e7,TRUE))))

system.time(any(duplicated(dat)))
#  user  system elapsed
# 1.632   0.007   1.671

使用 anyDuplicated.data.frame 需要25秒。

# if you want to calculate based on just var1 and var2
system.time(any(duplicated(dat, by=c("var1", "var2"))))
#  user  system elapsed
# 0.492   0.001   0.495

使用 anyDuplicated.data.frame 需要7.4秒。

这篇关于R：检查一组变量是否构成唯一索引的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R：检查一组变量是否构成唯一索引 [英] R: Checking if a set of variables forms a unique index

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R：检查一组变量是否构成唯一索引 [英] R: Checking if a set of variables forms a unique index

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭