过滤data.table中的重复/非唯一行 [英] Filtering out duplicated/non-unique rows in data.table

查看:73
本文介绍了过滤data.table中的重复/非唯一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 data.table 表,大约有250万行。有两列。我想删除在两列中重复的所有行。以前为一个data.frame我会这样做:
df - > unique(df [,c('V1','V2')])但这不适用于data.table。我已经尝试了 unique(df [,c(V1,V2),with = FALSE])但似乎仍然只操作数据表的键,而不是



有任何建议吗?



干杯,
Davy



示例

 > dt 
V1 V2
[ ] AB
[2,] AC
[3,] AD
[4,] AB
[5,] BA
[6, b [7,] CD
[8,] EF
[9,] GG
[10,] AB

在上面的data.table中, V2 是表键,只有4,7和10行将被删除。

 > dput(dt)
结构(列表(V1 = c(B,A,A,A,A,A,C,C,$ b $B,C,D,D,D,F ,
G)),.names = c(V1,V2),row.names = c(NA,-10L),class = c(data.table,
data.frame),.internal.selfref =< pointer:0x7fb4c4804578> ;, sorted =V2)


?unique.data.table ,显然调用 unique code>对数据表只能工作在键上。这意味着您必须在调用 unique 之前将键重置为所有列。

  library(data.table)
dt < - data.table(
V1 = LETTERS [c(1,1,1,1,2,3,3,5,7,1) ],
V2 = LETTERS [c(2,3,4,2,1,4,4,6,7,2)]

以一列为键,调用 unique

  setkey(dt,V2)
unique(dt)
V1 V2
[1,] BA
[ ] AB
[3,] AC
[4,] AD
[5,] EF
[6,] GG
/ pre>

将键重置为所有列,然后调用 unique

  setkey(dt)
unique(dt)
V1 V2
[1,] AB
[2, AC
[3,] AD
[4,] BA
[5,] CD
[6,] EF
[7,] GG






从马修编辑:



或者,删除键可以获得相同的结果,而不是将键设置为所有列,这可能需要一些时间用于许多行和许多列的大表。

  setkey(dt,NULL)
unique(dt)
V1 V2
1:AB
2:AC
3:AD
4:BA
5:CD
6:EF
7:GG


I have a data.table table with about 2.5 million rows. There are two columns. I want to remove any rows that are duplicated in both columns. Previously for a data.frame I would have done this: df -> unique(df[,c('V1', 'V2')]) but this doesn't work with data.table. I have tried unique(df[,c(V1,V2), with=FALSE]) but it seems to still only operate on the key of the data.table and not the whole row.

Any suggestions?

Cheers, Davy

Example

>dt
      V1   V2
[1,]  A    B
[2,]  A    C
[3,]  A    D
[4,]  A    B
[5,]  B    A
[6,]  C    D
[7,]  C    D
[8,]  E    F
[9,]  G    G
[10,] A    B

in the above data.table where V2 is the table key, only rows 4,7, and 10 would be removed.

> dput(dt)
structure(list(V1 = c("B", "A", "A", "A", "A", "A", "C", "C", 
"E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F", 
"G")), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x7fb4c4804578>, sorted = "V2")

解决方案

From ?unique.data.table, it is clear that calling unique on a data table only works on the key. This means you have to reset the key to all columns before calling unique.

library(data.table)
dt <- data.table(
  V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
  V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)

Calling unique with one column as key:

setkey(dt, "V2")
unique(dt)
     V1 V2
[1,]  B  A
[2,]  A  B
[3,]  A  C
[4,]  A  D
[5,]  E  F
[6,]  G  G

Reset the key to all columns, then call unique:

setkey(dt)
unique(dt)
     V1 V2
[1,]  A  B
[2,]  A  C
[3,]  A  D
[4,]  B  A
[5,]  C  D
[6,]  E  F
[7,]  G  G


Edit from Matthew :

Or, instead of setting the key to all columns which might take some time for large tables of many rows and many columns, removing the key achieves the same result :

setkey(dt,NULL)
unique(dt)
   V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  C  D
6:  E  F
7:  G  G

这篇关于过滤data.table中的重复/非唯一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆