如何删除data.table的不均匀列中的重复值? [英] How to remove duplicated values in uneven columns of a data.table?
问题描述
我想删除不均匀的data.table的每个集合中的重复值.例如,如果原始数据为(真实数据表具有许多列和行):
I want to remove duplicated values in each coulmn of an uneven data.table. For instance, if the original data is (the real data table has many columns and rows):
dt <- data.table(A = c("5p", "3p", "3p", "6y", NA), B = c("1c", "4r", "1c", NA, NA), C = c("4f", "5", "5", "5", "4m"))
> dt
A B C
1: 5p 1c 4f
2: 3p 4r 5
3: 3p 1c 5
4: 6y <NA> 5
5: <NA> <NA> 4m
在删除每列中的重复值之后,它应如下所示:
after removal of duplicated values in each column it should look like this:
A B C
5p 1c 4f
3p 4r 5
NA NA NA
6y NA NA
NA NA 4m
我正在尝试使用data.table在另一个线程中提出的解决方案.但是,我只获得每列中第一个重复的值替换为"NA",而不是随后的值.
I am trying a solution proposed in another thread using data.table. However, I only get the first duplicated value in each column replaced with "NA", but not the subsequents.
cols <- colnames(dt)
dt[, lapply(.SD, function(x) replace(x, anyDuplicated(x), NA)), .SDcols = cols]
> dt
A B C
1: 5p 1c 4f
2: 3p 4r 5
3: <NA> <NA> <NA>
4: 6y <NA> 5
5: <NA> <NA> 4m
我应该如何修改代码以替换所有重复项?
How should I modify the code to get all duplicates replaced?
推荐答案
我相信这将是实现此任务的正确 data.table
方式:
I believe this would be the proper data.table
way of achieving this task:
cols <- colnames(dt)
dt[, (cols) := lapply(.SD, function(x) replace(x, duplicated(x), NA))]
A B C
1: 5p 1c 4f
2: 3p 4r 5
3: <NA> <NA> <NA>
4: 6y <NA> <NA>
5: <NA> <NA> 4m
注意:
-
.SD
默认为所有列,因此在这种情况下无需指定.SDcols
自变量. - 使用
:=
可以避免复制整个data.table.
.SD
defaults to all columns, so there in this case there is no need to specify the.SDcols
argument.- Using
:=
avoids copying the whole data.table.
这篇关于如何删除data.table的不均匀列中的重复值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!