过滤data.table中的重复/非唯一行 [英] Filtering out duplicated/non-unique rows in data.table
问题描述
我有一个 data.table
表,大约有250万行。有两列。我想删除在两列中重复的所有行。以前为一个data.frame我会这样做:
df - > unique(df [,c('V1','V2')])
但这不适用于data.table。我已经尝试了 unique(df [,c(V1,V2),with = FALSE])
但似乎仍然只操作数据表的键,而不是
有任何建议吗?
干杯,
Davy
示例
> dt
V1 V2
[ ] AB
[2,] AC
[3,] AD
[4,] AB
[5,] BA
[6, b [7,] CD
[8,] EF
[9,] GG
[10,] AB
在上面的data.table中, V2
是表键,只有4,7和10行将被删除。
> dput(dt)
结构(列表(V1 = c(B,A,A,A,A,A,C,C,$ b $B,C,D,D,D,F ,
G)),.names = c(V1,V2),row.names = c(NA,-10L),class = c(data.table,
data.frame),.internal.selfref =< pointer:0x7fb4c4804578> ;, sorted =V2)
unique code>对数据表只能工作在键上。这意味着您必须在调用 unique
之前将键重置为所有列。 library(data.table)
dt < - data.table(
V1 = LETTERS [c(1,1,1,1,2,3,3,5,7,1) ],
V2 = LETTERS [c(2,3,4,2,1,4,4,6,7,2)]
)
以一列为键,调用 unique
:
setkey(dt,V2)
unique(dt)
V1 V2
[1,] BA
[ ] AB
[3,] AC
[4,] AD
[5,] EF
[6,] GG
/ pre>
将键重置为所有列,然后调用 unique
:
setkey(dt)
unique(dt)
V1 V2
[1,] AB
[2, AC
[3,] AD
[4,] BA
[5,] CD
[6,] EF
[7,] GG
从马修编辑:
或者,删除键可以获得相同的结果,而不是将键设置为所有列,这可能需要一些时间用于许多行和许多列的大表。
setkey(dt,NULL)
unique(dt)
V1 V2
1:AB
2:AC
3:AD
4:BA
5:CD
6:EF
7:GG
I have a data.table
table with about 2.5 million rows. There are two columns. I want to remove any rows that are duplicated in both columns. Previously for a data.frame I would have done this:
df -> unique(df[,c('V1', 'V2')])
but this doesn't work with data.table. I have tried unique(df[,c(V1,V2), with=FALSE])
but it seems to still only operate on the key of the data.table and not the whole row.
Any suggestions?
Cheers,
Davy
Example
>dt
V1 V2
[1,] A B
[2,] A C
[3,] A D
[4,] A B
[5,] B A
[6,] C D
[7,] C D
[8,] E F
[9,] G G
[10,] A B
in the above data.table where V2
is the table key, only rows 4,7, and 10 would be removed.
> dput(dt)
structure(list(V1 = c("B", "A", "A", "A", "A", "A", "C", "C",
"E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F",
"G")), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x7fb4c4804578>, sorted = "V2")
解决方案 From ?unique.data.table
, it is clear that calling unique
on a data table only works on the key. This means you have to reset the key to all columns before calling unique
.
library(data.table)
dt <- data.table(
V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)
Calling unique
with one column as key:
setkey(dt, "V2")
unique(dt)
V1 V2
[1,] B A
[2,] A B
[3,] A C
[4,] A D
[5,] E F
[6,] G G
Reset the key to all columns, then call unique
:
setkey(dt)
unique(dt)
V1 V2
[1,] A B
[2,] A C
[3,] A D
[4,] B A
[5,] C D
[6,] E F
[7,] G G
Edit from Matthew :
Or, instead of setting the key to all columns which might take some time for large tables of many rows and many columns, removing the key achieves the same result :
setkey(dt,NULL)
unique(dt)
V1 V2
1: A B
2: A C
3: A D
4: B A
5: C D
6: E F
7: G G
这篇关于过滤data.table中的重复/非唯一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!