如何删除data.table的不均匀列中的重复值? [英] How to remove duplicated values in uneven columns of a data.table?

查看:94
本文介绍了如何删除data.table的不均匀列中的重复值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想删除不均匀的data.table的每个集合中的重复值.例如,如果原始数据为(真实数据表具有许多列和行):

I want to remove duplicated values in each coulmn of an uneven data.table. For instance, if the original data is (the real data table has many columns and rows):

dt <- data.table(A = c("5p", "3p", "3p", "6y", NA), B = c("1c", "4r", "1c", NA, NA), C = c("4f", "5", "5", "5", "4m"))
> dt
      A    B  C
1:   5p   1c 4f
2:   3p   4r  5
3:   3p   1c  5
4:   6y <NA>  5
5: <NA> <NA> 4m

在删除每列中的重复值之后,它应如下所示:

after removal of duplicated values in each column it should look like this:

A    B    C
5p   1c   4f
3p   4r   5
NA   NA   NA
6y   NA   NA
NA   NA   4m

我正在尝试使用data.table在另一个线程中提出的解决方案.但是,我只获得每列中第一个重复的值替换为"NA",而不是随后的值.

I am trying a solution proposed in another thread using data.table. However, I only get the first duplicated value in each column replaced with "NA", but not the subsequents.

cols <- colnames(dt)
dt[, lapply(.SD, function(x) replace(x, anyDuplicated(x), NA)), .SDcols = cols]
> dt
      A    B    C
1:   5p   1c   4f
2:   3p   4r    5
3: <NA> <NA> <NA>
4:   6y <NA>    5
5: <NA> <NA>   4m

我应该如何修改代码以替换所有重复项?

How should I modify the code to get all duplicates replaced?

推荐答案

我相信这将是实现此任务的正确 data.table方式:

I believe this would be the proper data.table way of achieving this task:

cols <- colnames(dt)
dt[, (cols) := lapply(.SD, function(x) replace(x, duplicated(x), NA))]

      A    B    C
1:   5p   1c   4f
2:   3p   4r    5
3: <NA> <NA> <NA>
4:   6y <NA> <NA>
5: <NA> <NA>   4m

注意:

  • .SD默认为所有列,因此在这种情况下无需指定.SDcols自变量.
  • 使用:=可以避免复制整个data.table.
  • .SD defaults to all columns, so there in this case there is no need to specify the .SDcols argument.
  • Using := avoids copying the whole data.table.

这篇关于如何删除data.table的不均匀列中的重复值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆