过滤掉 data.table 中的重复/非唯一行 [英] Filtering out duplicated/non-unique rows in data.table
问题描述
Edit 2019: 这个问题是在 2016 年 11 月 data.table
更改之前提出的,请参阅下面接受的当前和以前方法的答案.
我有一个包含大约 250 万行的 data.table
表.有两列.我想删除在两列中重复的任何行.以前对于 data.frame 我会这样做:<代码>df ->unique(df[,c('V1', 'V2')]) 但这不适用于 data.table.我已经尝试过 unique(df[,c(V1,V2), with=FALSE])
但它似乎仍然只对 data.table 的键而不是整行进行操作.p>
有什么建议吗?
干杯,戴维
例子
>dtV1 V2[1,] A B[2,] A C[3,] A D[4,] A B[5,] B A[6,] C D[7,]CD[8,] 英法[9,] G G[10,] A B
在上面的data.table中,V2
是表键,只有第4、7和10行会被删除.
<代码>>输出(dt)结构(列表(V1 = c(B",A",A",A",A",A",C",C",E",G"),V2 = c(A",B",B",B",C",D",D",D",F","G")), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = c("data.table","data.frame"), .internal.selfref = <pointer: 0x7fb4c4804578>, sorted = "V2")
适用于 v1.9.8+ (2016 年 11 月发布)
来自 ?unique.data.table
默认使用所有列(这与 ?unique.data.frame
一致)
唯一(dt)V1 V21:A B2:A C3:A D4: B A5:CD6: E F7:GG
或者使用 by
参数来获得特定列的唯一组合(就像以前使用的键一样)
unique(dt, by = "V2")V1 V21:A B2:A C3:A D4: B A5: E F6:GG
之前的 v1.9.8
从?unique.data.table
可以看出,对数据表调用unique
只对key 有效.这意味着您必须在调用 unique
之前将键重置为所有列.
库(data.table)dt <- data.table(V1=字母[c(1,1,1,1,2,3,3,5,7,1)],V2=字母[c(2,3,4,2,1,4,4,6,7,2)])
以一列为键调用unique
:
setkey(dt, "V2")独特的(dt)V1 V2[1,] B A[2,] A B[3,] A C[4,] A D[5,] 英法[6,] G G
<小时>
Edit 2019: This question was asked prior to changes in data.table
in November 2016, see the accepted answer below for both the current and previous methods.
I have a data.table
table with about 2.5 million rows. There are two columns. I want to remove any rows that are duplicated in both columns. Previously for a data.frame I would have done this:
df -> unique(df[,c('V1', 'V2')])
but this doesn't work with data.table. I have tried unique(df[,c(V1,V2), with=FALSE])
but it seems to still only operate on the key of the data.table and not the whole row.
Any suggestions?
Cheers, Davy
Example
>dt
V1 V2
[1,] A B
[2,] A C
[3,] A D
[4,] A B
[5,] B A
[6,] C D
[7,] C D
[8,] E F
[9,] G G
[10,] A B
in the above data.table where V2
is the table key, only rows 4,7, and 10 would be removed.
> dput(dt)
structure(list(V1 = c("B", "A", "A", "A", "A", "A", "C", "C",
"E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F",
"G")), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x7fb4c4804578>, sorted = "V2")
For v1.9.8+ (released November 2016)
From ?unique.data.table
By default all columns are being used (which is consistent with ?unique.data.frame
)
unique(dt)
V1 V2
1: A B
2: A C
3: A D
4: B A
5: C D
6: E F
7: G G
Or using the by
argument in order to get unique combinations of specific columns (like previously keys were used for)
unique(dt, by = "V2")
V1 V2
1: A B
2: A C
3: A D
4: B A
5: E F
6: G G
Prior v1.9.8
From ?unique.data.table
, it is clear that calling unique
on a data table only works on the key. This means you have to reset the key to all columns before calling unique
.
library(data.table)
dt <- data.table(
V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)
Calling unique
with one column as key:
setkey(dt, "V2")
unique(dt)
V1 V2
[1,] B A
[2,] A B
[3,] A C
[4,] A D
[5,] E F
[6,] G G
这篇关于过滤掉 data.table 中的重复/非唯一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!