在 R 中测试多个相同的列 [英] Testing for multiple identical columns in R

查看:15
本文介绍了在 R 中测试多个相同的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种简单的方法来测试多个列的身份?例如,在这个输入上

Is there a short way to test for identity over multiple columns? For example, over this input

data=data.table(one=c(1,2,3,4), two=c(7,8,9,10), three=c(1,2,3,4), four=c(1,2,3,4) )

有什么东西可以返回与 data$one 相同的所有列吗?像

Is there something that would return all the columns that are identical to data$one? Something like

allcolumnsidentity(data$one, data) # compares all columns with respect to data$one 

应该返回 (TRUE, FALSE, TRUE, TRUE),因为 data$3 和 data$4 与 data$one 相同.

Should return (TRUE, FALSE, TRUE, TRUE) since data$three and data$four are identical to data$one.

我看到了相同的()和comapre()命令,但它们处理两列之间的比较.有通用的方法吗?

I saw the identical() and comapre() commands, but they deal with comparing between two columns. Is there a generalized way to do it?

最好的祝福

推荐答案

这里有另外 3 个可能的解决方案,一个更大的数据集的基准

Here are 3 more possible solutions an a benchmark on a bit bigger data set

n <- 1e6
data=data.table(one=rep(1:4, n), 
                two=rep(7:10, n),
                three=rep(1:4, n), 
                four=rep(1:4, n))

library(microbenchmark)
microbenchmark(
              apply(data, 2, identical, data$one) ,
              colSums(data == data$one) == nrow(data),
              colSums(as.matrix(data) == data$one) == nrow(data),
              data[, lapply(.SD, function(x) sum(x == data$one) == .N)],
              data[, lapply(.SD, function(x) identical(x, data$one))]
)


# Unit: milliseconds
#                                                      expr        min          lq        mean      median          uq        max neval
#                       apply(data, 2, identical, data$one)  352.58769  414.846535  457.767582  437.041789  521.895046  643.77981   100
#                   colSums(data == data$one) == nrow(data) 1264.95548 1315.882084 1335.827386 1326.250976 1346.501505 1466.64232   100
#        colSums(as.matrix(data) == data$one) == nrow(data)  110.05474  114.618818  125.116033  121.631323  126.912647  185.69939   100
# data[, lapply(.SD, function(x) sum(x == data$one) == .N)]   75.36791   77.960613   85.599088   79.327108   89.369938  156.03422   100
#   data[, lapply(.SD, function(x) identical(x, data$one))]    7.00261    7.448851    8.687903    8.776724    9.491253   15.72188   100

如果你有很多列,这里有一些比较

And here are some comparisons in case you have many columns

n <- 1e7
set.seed(123)
data <- data.table(matrix(sample(n, replace = TRUE), ncol = 400))

microbenchmark(
               apply(data, 2, identical, data$V1) ,
               colSums(data == data$V1) == nrow(data),
               colSums(as.matrix(data) == data$V1) == nrow(data),
               data[, lapply(.SD, function(x) sum(x == data$V1) == .N)],
               data[, lapply(.SD, function(x) identical(x,data$V1))]
)

# Unit: milliseconds
#                                                     expr       min        lq      mean    median        uq       max neval
#                       apply(data, 2, identical, data$V1) 176.65997 185.23895 235.44088 234.60227 253.88658 331.18788   100
#                   colSums(data == data$V1) == nrow(data) 680.48398 759.82115 786.64634 774.86919 804.91661 987.26456   100
#        colSums(as.matrix(data) == data$V1) == nrow(data)  60.62470  62.86181  70.41601  63.75478  65.16708 120.30393   100
# data[, lapply(.SD, function(x) sum(x == data$V1) == .N)]  83.95790  86.72680  90.45487  88.46165  90.04441 142.08614   100
#   data[, lapply(.SD, function(x) identical(x, data$V1))]  40.86718  42.65486  45.06100  44.29602  45.49430  91.57465   100

这篇关于在 R 中测试多个相同的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆