过滤data.table中的重复/非唯一行 [英] Filtering out duplicated/non-unique rows in data.table

查看：73 发布时间：2017/3/12 10:07:29 r data.table duplicate-removal

本文介绍了过滤data.table中的重复/非唯一行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 data.table 表，大约有250万行。有两列。我想删除在两列中重复的所有行。以前为一个data.frame我会这样做：
df - > unique（df [，c（'V1'，'V2'）]）但这不适用于data.table。我已经尝试了 unique（df [，c（V1，V2），with = FALSE]）但似乎仍然只操作数据表的键，而不是

有任何建议吗？

干杯，
Davy

示例

 > dt 
 V1 V2 
 [ ] AB 
 [2，] AC 
 [3，] AD 
 [4，] AB 
 [5，] BA 
 [6， b [7，] CD 
 [8，] EF 
 [9，] GG 
 [10，] AB

在上面的data.table中， V2 是表键，只有4,7和10行将被删除。

 > dput（dt）
结构（列表（V1 = c（B，A，A，A，A，A，C，C，$ b $B，C，D，D，D，F ，
G）），.names = c（V1，V2），row.names = c（NA，-10L），class = c（data.table，
data.frame），.internal.selfref =< pointer：0x7fb4c4804578> ;, sorted =V2）

？unique.data.table ，显然调用

 unique  code>对数据表只能工作在键上。这意味着您必须在调用 unique 之前将键重置为所有列。
  library（data.table）
 dt < -  data.table（
 V1 = LETTERS [c（1,1,1,1,2,3,3,5,7,1） ]，
 V2 = LETTERS [c（2,3,4,2,1,4,4,6,7,2）] 
）
  
以一列为键，调用 unique ：
  setkey（dt，V2）
 unique（dt）
 V1 V2 
 [1，] BA 
 [ ] AB 
 [3，] AC 
 [4，] AD 
 [5，] EF 
 [6，] GG 
  / pre> 
 
 将键重置为所有列，然后调用 unique ：
  setkey（dt）
 unique（dt）
 V1 V2 
 [1，] AB 
 [2， AC 
 [3，] AD 
 [4，] BA 
 [5，] CD 
 [6，] EF 
 [7，] GG 
  
 
 
 
 
 
 从马修编辑：
 
 
 或者，删除键可以获得相同的结果，而不是将键设置为所有列，这可能需要一些时间用于许多行和许多列的大表。
  setkey（dt，NULL）
 unique（dt）
 V1 V2 
 1：AB 
 2：AC 
 3：AD 
 4：BA 
 5：CD 
 6：EF 
 7：GG 
  
 
I have a data.table table with about 2.5 million rows. There are two columns. I want to remove any rows that are duplicated in both columns. Previously for a data.frame I would have done this:
df -> unique(df[,c('V1', 'V2')]) but this doesn't work with data.table. I have tried unique(df[,c(V1,V2), with=FALSE]) but it seems to still only operate on the key of the data.table and not the whole row.

Any suggestions?

Cheers,
Davy

Example
>dt
      V1   V2
[1,]  A    B
[2,]  A    C
[3,]  A    D
[4,]  A    B
[5,]  B    A
[6,]  C    D
[7,]  C    D
[8,]  E    F
[9,]  G    G
[10,] A    B
in the above data.table where V2 is the  table key, only rows 4,7, and 10 would be removed.
> dput(dt)
structure(list(V1 = c("B", "A", "A", "A", "A", "A", "C", "C", 
"E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F", 
"G")), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x7fb4c4804578>, sorted = "V2")

 解决方案 
From ?unique.data.table, it is clear that calling unique on a data table only works on the key. This means you have to reset the key to all columns before calling unique.
library(data.table)
dt <- data.table(
  V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
  V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)
Calling unique with one column as key:
setkey(dt, "V2")
unique(dt)
     V1 V2
[1,]  B  A
[2,]  A  B
[3,]  A  C
[4,]  A  D
[5,]  E  F
[6,]  G  G
Reset the key to all columns, then call unique:
setkey(dt)
unique(dt)
     V1 V2
[1,]  A  B
[2,]  A  C
[3,]  A  D
[4,]  B  A
[5,]  C  D
[6,]  E  F
[7,]  G  G




Edit from Matthew :

Or, instead of setting the key to all columns which might take some time for large tables of many rows and many columns, removing the key achieves the same result :
setkey(dt,NULL)
unique(dt)
   V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  C  D
6:  E  F
7:  G  G


                        
这篇关于过滤data.table中的重复/非唯一行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

过滤data.table中的重复/非唯一行 [英] Filtering out duplicated/non-unique rows in data.table

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

过滤data.table中的重复/非唯一行 [英] Filtering out duplicated/non-unique rows in data.table

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭