在R中的一小组列中查找具有重复值的行 [英] Find rows with duplicate values across a small set of columns in R

查看:97
本文介绍了在R中的一小组列中查找具有重复值的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设有一个data.table在其他四列中有一个id和整数值。如何有效地找到四个其他列中的四个值中至少有两个是相同的行?

Let's say that have a data.table with an id and integer values in four other columns. How can I efficiently find the rows where at least two of the four values in the four other columns are the same?

fooTbl = data.table(id = c('a', 'b'), ind1=c(1,2), ind2=c(3,4), ind3=c(2,3), ind4=c(2,1))
fooTbl
#    id ind1 ind2 ind3 ind4
# 1:  a    1    3    2    2
# 2:  b    2    4    3    1

我已经有两个解决方案了。第一个比第二个快得多,但第一个要求硬编码所有的组合,并检查每个组合的相等性。随着列数的增加,这似乎是不可取的和难以维护的:

I have two solutions already. The first is much faster than the second, but the first requires hard-coding all of the combinations and checking equality for each of them. This seems undesirable and difficult to maintain as the number of columns increases:

fooTbl[, uniq := (ind1 != ind2 & ind1 != ind3 & ind1 != ind4 & ind2 != ind3 & ind2 != ind4 & ind3 != ind4)]
fooTbl
#    id ind1 ind2 ind3 ind4  uniq
# 1:  a    1    3    2    2 FALSE
# 2:  b    2    4    3    1  TRUE

第二个是使用data.table并对长表形式的表进行操作。这个更可维护(没有硬编码所有的组合),但是更慢:

The second is to use data.table and operate on a long form of the table. This one is more maintainable (no hard coding of all of the combinations) but is much slower:

fooTbl[, uniq := NULL]
fooTbl
#    id ind1 ind2 ind3 ind4
# 1:  a    1    3    2    2
# 2:  b    2    4    3    1
fooTbl = melt(fooTbl, measure=c('ind1', 'ind2', 'ind3', 'ind4'))
fooTbl
#    id variable value
# 1:  a     ind1     1
# 2:  b     ind1     2
# 3:  a     ind2     3
# 4:  b     ind2     4
# 5:  a     ind3     2
# 6:  b     ind3     3
# 7:  a     ind4     2
# 8:  b     ind4     1
fooTbl[, N := length(unique(value)), by=id]
fooTbl[, uniq := N == 4][, N := NULL]
fooTbl
   id variable value  uniq
1:  a     ind1     1 FALSE
2:  b     ind1     2  TRUE
3:  a     ind2     3 FALSE
4:  b     ind2     4  TRUE
5:  a     ind3     2 FALSE
6:  b     ind3     3  TRUE
7:  a     ind4     2 FALSE
8:  b     ind4     1  TRUE
fooTbl = dcast(fooTbl, id + uniq ~ variable, value.var='value')
fooTbl
  id  uniq ind1 ind2 ind3 ind4
1  a FALSE    1    3    2    2
2  b  TRUE    2    4    3    1

第一(宽)解决方案的速度,没有硬编码所有的检查组合?

Is there a way that I can get the speed of the first (wide) solution without hard coding all of the combinations of checks?

N对于我的实际表是可管理的(〜3M),但足够大第二个解决方案中的by操作的权重。

N for my actual table is manageable (~ 3M) but large enough to feel the weight of the by operation in the second solution.

推荐答案

我最终使用@ Arun的建议并以编程方式构建表达式并评估。这里有一个data.table的具体实现。我不得不诉诸于字符串操作(而不是只使用符号使用bquote操作),所以它不是那么干净,但它的工作原理。

I ended up going with @Arun's suggestion and programmatically built up the expression and evaluated it. Here's a data.table specific implementation. I had to resort to string manipulation (instead of operating only on symbols using bquote), so it's not as clean as I would like, but it works.

allColUniqExpr <- function(colNames, resColName) {
    makeExpr = function(x) sprintf('%s != %s', x[1], x[2])
    expr = apply(combn(colNames, 2), 2, makeExpr)
    expr = paste(expr, sep='', collapse=' & ')
    expr = sprintf('%s := %s', resColName, expr)
    expr = parse(text=expr)
    expr
}

要使用:

fooTbl[, eval(allColUniqExpr(c('ind1', 'ind2', 'ind3', 'ind4'), 'uniq'))]

这篇关于在R中的一小组列中查找具有重复值的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆