过滤掉 data.table 中的重复/非唯一行 [英] Filtering out duplicated/non-unique rows in data.table

查看:9
本文介绍了过滤掉 data.table 中的重复/非唯一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Edit 2019: 这个问题是在 2016 年 11 月 data.table 更改之前提出的,请参阅下面接受的当前和以前方法的答案.

我有一个包含大约 250 万行的 data.table 表.有两列.我想删除在两列中重复的任何行.以前对于 data.frame 我会这样做:<代码>df ->unique(df[,c('V1', 'V2')]) 但这不适用于 data.table.我已经尝试过 unique(df[,c(V1,V2), with=FALSE]) 但它似乎仍然只对 data.table 的键而不是整行进行操作.p>

有什么建议吗?

干杯,戴维

例子

>dtV1 V2[1,] A B[2,] A C[3,] A D[4,] A B[5,] B A[6,] C D[7,]CD[8,] 英法[9,] G G[10,] A B

在上面的data.table中,V2是表键,只有第4、7和10行会被删除.

<代码>>输出(dt)结构(列表(V1 = c(B",A",A",A",A",A",C",C",E",G"),V2 = c(A",B",B",B",C",D",D",D",F","G")), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = c("data.table","data.frame"), .internal.selfref = <pointer: 0x7fb4c4804578>, sorted = "V2")

解决方案

适用于 v1.9.8+ (2016 年 11 月发布)

来自 ?unique.data.table默认使用所有列(这与 ?unique.data.frame 一致)

唯一(dt)V1 V21:A B2:A C3:A D4: B A5:CD6: E F7:GG

或者使用 by 参数来获得特定列的唯一组合(就像以前使用的键一样)

unique(dt, by = "V2")V1 V21:A B2:A C3:A D4: B A5: E F6:GG

之前的 v1.9.8

?unique.data.table 可以看出,对数据表调用unique 只对key 有效.这意味着您必须在调用 unique 之前将键重置为所有列.

库(data.table)dt <- data.table(V1=字母[c(1,1,1,1,2,3,3,5,7,1)],V2=字母[c(2,3,4,2,1,4,4,6,7,2)])

以一列为键调用unique:

setkey(dt, "V2")独特的(dt)V1 V2[1,] B A[2,] A B[3,] A C[4,] A D[5,] 英法[6,] G G

<小时>

Edit 2019: This question was asked prior to changes in data.table in November 2016, see the accepted answer below for both the current and previous methods.

I have a data.table table with about 2.5 million rows. There are two columns. I want to remove any rows that are duplicated in both columns. Previously for a data.frame I would have done this: df -> unique(df[,c('V1', 'V2')]) but this doesn't work with data.table. I have tried unique(df[,c(V1,V2), with=FALSE]) but it seems to still only operate on the key of the data.table and not the whole row.

Any suggestions?

Cheers, Davy

Example

>dt
      V1   V2
[1,]  A    B
[2,]  A    C
[3,]  A    D
[4,]  A    B
[5,]  B    A
[6,]  C    D
[7,]  C    D
[8,]  E    F
[9,]  G    G
[10,] A    B

in the above data.table where V2 is the table key, only rows 4,7, and 10 would be removed.

> dput(dt)
structure(list(V1 = c("B", "A", "A", "A", "A", "A", "C", "C", 
"E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F", 
"G")), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x7fb4c4804578>, sorted = "V2")

解决方案

For v1.9.8+ (released November 2016)

From ?unique.data.table By default all columns are being used (which is consistent with ?unique.data.frame)

unique(dt)
   V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  C  D
6:  E  F
7:  G  G

Or using the by argument in order to get unique combinations of specific columns (like previously keys were used for)

unique(dt, by = "V2")
   V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  E  F
6:  G  G

Prior v1.9.8

From ?unique.data.table, it is clear that calling unique on a data table only works on the key. This means you have to reset the key to all columns before calling unique.

library(data.table)
dt <- data.table(
  V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
  V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)

Calling unique with one column as key:

setkey(dt, "V2")
unique(dt)
     V1 V2
[1,]  B  A
[2,]  A  B
[3,]  A  C
[4,]  A  D
[5,]  E  F
[6,]  G  G


这篇关于过滤掉 data.table 中的重复/非唯一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆