将集合操作从 R 的数据帧移植到数据表:如何识别重复行? [英] Porting set operations from R's data frames to data tables: How to identify duplicated rows?

查看:8
本文介绍了将集合操作从 R 的数据帧移植到数据表:如何识别重复行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

[更新 1:正如 Matthew Dowle 所指出的,我在 R-Forge 上使用的是 data.table 版本 1.6.7,而不是 CRAN.您不会看到与早期版本的 data.table 相同的行为.]

[Update 1: As Matthew Dowle noted, I'm using data.table version 1.6.7 on R-Forge, not CRAN. You won't see the same behavior with an earlier version of data.table.]

作为背景:我正在移植一些小实用程序函数来对数据框的行或数据框对(即每一行是集合中的一个元素)进行集合操作,例如unique - 从列表、联合、交集、集合差异等创建集合.这些模仿 Matlab 的 intersect(...,'rows'), setdiff(...,'rows') 等,它们在 R 中似乎没有对应物(R 的集合操作仅限于向量和列表,但不包括矩阵行或数据帧).这些小功能的示例如下.如果数据框的这个功能已经存在于某个包或基础 R 中,我愿意接受建议.

As background: I am porting some little utility functions to do set operations on rows of a data frame or pairs of data frames (i.e. each row is an element in a set), e.g. unique - to create a set from a list, union, intersection, set difference, etc. These mimic Matlab's intersect(...,'rows'), setdiff(...,'rows'), etc., which don't appear to have counterparts in R (R's set operations are limited to vectors and lists, but not rows of matrices or data frames). Examples of these little functions are below. If this functionality for data frames already exists in some package or base R, I'm open to suggestions.

我一直在将这些迁移到数据表中,当前方法中的一个必要步骤是查找重复的行.执行 duplicated() 时会返回一个错误,指出数据表必须有键.这是一个不幸的障碍 - 除了设置键(这不是通用解决方案并增加计算成本)之外,还有其他方法可以找到重复的对象吗?

I have been migrating these to data tables and one necessary step in the current approach is to find duplicated rows. When duplicated() is executed an error is returned stating that data tables must have keys. This is an unfortunate roadblock - other than setting keys, which isn't a universal solution and adds to computational costs, is there some other way to find duplicated objects?

这是一个可重现的例子:

Here is a reproducible example:

library(data.table)
set.seed(0)
x   <- as.data.table(matrix(sample(2, 100, replace = TRUE), ncol = 4))
y   <- as.data.table(matrix(sample(2, 100, replace = TRUE), ncol = 4))

res3    <- dt_intersect(x,y)

产生此错误消息:

Error in duplicated.data.table(z_rbind) : data table must have keys

代码按原样用于数据帧,但我已使用模式 dt_operation 命名每个函数.

The code works as-is for data frames, though I've named each function with the pattern dt_operation.

有没有办法解决这个问题?设置键仅适用于整数,这是我不能为输入数据假设的约束.那么,也许我错过了使用数据表的巧妙方法?

Is there some way to get around this issue? Setting keys only works for integers, which is a constraint I can't assume for the input data. So, perhaps I'm missing a clever way to use data tables?

示例集合操作函数,其中集合的元素是数据行:

Example set operation functions, where the elements of the sets are rows of data:

dt_unique       <- function(x){
    return(unique(x))
}

dt_union        <- function(x,y){
    z_rbind     <- rbind(x,y)
    z_unique    <- dt_unique(z_rbind)
    return(z_unique)
}

dt_intersect    <- function(x,y){
    zx          <- dt_unique(x)
    zy          <- dt_unique(y)

    z_rbind     <- rbind(zy,zx)
    ixDupe      <- which(duplicated(z_rbind))
    z           <- z_rbind[ixDupe,]
    return(z)
}

dt_setdiff      <- function(x,y){
    zx          <- dt_unique(x)
    zy          <- dt_unique(y)

    z_rbind     <- rbind(zy,zx)
    ixRangeX    <- (nrow(zy) + 1):nrow(z_rbind)
    ixNotDupe   <- which(!duplicated(z_rbind))
    ixDiff      <- intersect(ixNotDupe, ixRangeX)
    diffX       <- z_rbind[ixDiff,]
    return(diffX)
}

<小时>

注意 1:这些辅助函数的一个预期用途是查找 x 中的键值不在 y 中的键值中的行.这样,我可以在计算 x[y]y[x] 时找到可能出现 NA 的位置.虽然这种用法允许为 z_rbind 对象设置键,但我不希望自己仅限于这个用例.


Note 1: One intended use for these helper functions is to find rows where key values in x are not among the key values in y. This way, I can find where NAs may appear when calculating x[y] or y[x]. Although this usage allows for setting of keys for the z_rbind object, I'd prefer not to constrain myself to just this use case.

注 2:对于相关帖子,这是一篇关于在数据帧上运行 unique 的帖子,使用更新的 data.table 包运行它的效果非常好.这是一篇较早的帖子,关于运行 唯一在数据表上.

Note 2: For related posts, here is a post on running unique on data frames, with excellent results for running it with the updated data.table package. And this is an earlier post on running unique on data tables.

推荐答案

duplicated.data.table 需要同样的修复 unique.data.table 得到 .请提出另一个错误报告:bug.report(package="data.table").为了其他人的观看,您已经在使用 R-Forge 的 v1.6.7,而不是 CRAN 上的 1.6.6.

duplicated.data.table needs the same fix unique.data.table got . Please raise another bug report: bug.report(package="data.table"). For the benefit of others watching, you're already using v1.6.7 from R-Forge, not 1.6.6 on CRAN.

但是,在注释 1 中,有一个不加入"成语:

But, on Note 1, there's a 'not join' idiom :

x[-x[y,which=TRUE]]

另请参阅 FR#1384(新的 'not' 和 'whichna' 参数?)以使用户更轻松,并链接到 不匹配的键 线程更详细.

See also FR#1384 (New 'not' and 'whichna' arguments?) to make that easier for users, and that links to the keys that don't match thread which goes into more detail.

更新.现在在 v1.8.3 中,已经实现了 not-join.

Update. Now in v1.8.3, not-join has been implemented.

DT[-DT["a",which=TRUE,nomatch=0],...]   # old idiom
DT[!"a",...]                            # same result, now preferred.

这篇关于将集合操作从 R 的数据帧移植到数据表:如何识别重复行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆