在 data.table 中查找 *all* 重复记录(不是全部但一个) [英] find *all* duplicated records in data.table (not all-but-one)

查看:8
本文介绍了在 data.table 中查找 *all* 重复记录(不是全部但一个)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我理解正确,data.tableduplicated() 函数返回一个不包含第一次出现重复记录的逻辑向量.标记第一次出现的最佳方法是什么?在 base::duplicated() 的情况下,我通过反序函数的析取解决了这个问题:myDups <- (duplicated(x) | duplicated(x, fromLast=TRUE)) - 但在 data.table::duplicated() 中,fromLast=TRUE 不包括在内(我不知道为什么)...

if I understand correctly, duplicated() function for data.table returns a logical vector which doesn't contain first occurrence of duplicated record. What is the best way to mark this first occurrence as well? In case of base::duplicated(), I solved this simply by disjunction with reversed order function: myDups <- (duplicated(x) | duplicated(x, fromLast=TRUE)) - but in data.table::duplicated(), fromLast=TRUE is not included (I don't know why)...

附:好的,这是一个原始示例

P.S. ok, here's a primitive example

myDT <- fread(
"id,fB,fC
 1, b1,c1
 2, b2,c2
 3, b1,c1
 4, b3,c3
 5, b1,c1
")
setkeyv(myDT, c('fB', 'fC'))
myDT[, fD:=duplicated(myDT)]

第 1、3 和 5 行都是重复的,但只有 3 和 5 会包含在 duplicated 中,而我需要将它们全部标记出来.

rows 1, 3 and 5 are all duplicates but only 3 and 5 will be included in duplicated while I need to mark all of them.

UPD.重要通知:我在下面接受的答案仅适用于键控表.如果要查找考虑所有列的重复记录,则必须显式 setkey 所有这些列.到目前为止,我专门针对这种情况使用了以下解决方法:

UPD. important notice: the answer I've accepted below works only for keyed table. If you want to find duplicate records considering all columns, you have to setkey all these columns explicitly. So far I use the following workaround specifically for this case:

dups1 <- duplicated(myDT);
dups2 <- duplicated(myDT, fromLast=T);
dups <- dups1 | dups2;

推荐答案

从data.table 1.9.8版本开始,eddi的解决方案需要修改为:

As of data.table version 1.9.8, the solution by eddi needs to be modified to be:

dups = duplicated(myDT, by = key(myDT));
myDT[, fD := dups | c(tail(dups, -1), FALSE)]

因为:

v1.9.8 中的更改(CRAN 2016 年 11 月 25 日)

Changes in v1.9.8 (on CRAN 25 Nov 2016)

潜在的重大变化

默认情况下,所有列现在都由 unique()、duplicated() 和uniqueN() data.table 方法,#1284 和 #1841.要恢复旧行为:选项(datatable.old.unique.by.key=TRUE).1年内这恢复旧默认值的选项将被警告弃用.在2 年该选项将被删除.请明确传递=key(DT)为了清楚起见.只有依赖默认值的代码才会受到影响.266之前检查过使用 data.table 的 CRAN 和 Bioconductor 包发布.9 需要更改并被通知.任何代码行这些检查将错过没有测试覆盖的情况.任何未检查 CRAN 或 Bioconductor 上的包裹.

By default all columns are now used by unique(), duplicated() and uniqueN() data.table methods, #1284 and #1841. To restore old behaviour: options(datatable.old.unique.by.key=TRUE). In 1 year this option to restore the old default will be deprecated with warning. In 2 years the option will be removed. Please explicitly pass by=key(DT) for clarity. Only code that relies on the default is affected. 266 CRAN and Bioconductor packages using data.table were checked before release. 9 needed to change and were notified. Any lines of code without test coverage will have been missed by these checks. Any packages not on CRAN or Bioconductor were not checked.

这篇关于在 data.table 中查找 *all* 重复记录(不是全部但一个)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆