R即使使用mclapply，检查重复项也非常缓慢 [英] R Checking for duplicates is painfully slow, even with mclapply

查看：348 发布时间：2020/5/13 2:27:35 r optimization multicore domc mclapply

本文介绍了R即使使用mclapply，检查重复项也非常缓慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些涉及重复销售一堆具有独特ID的汽车的数据.一辆汽车可以卖不止一次.

I've got some data involving repeated sales for a bunch of of cars with unique Ids. A car can be sold more than once.

但是有些ID是错误的，因此我要检查每个ID的大小是否在多次销售中都记录为相同的大小.如果不是，那么我知道该ID是错误的.

Some of the Ids are erroneous however, so I'm checking, for each Id, if the size is recorded as the same over multiple sales. If it isn't, then I know that the Id is erroneous.

我正在尝试使用以下代码进行此操作:

I'm trying to do this with the following code:

library("doMC")

Data <- data.frame(ID=c(15432,67325,34623,15432,67325,34623),Size=c("Big","Med","Small","Big","Med","Big"))
compare <- function(v) all(sapply( as.list(v[-1]), FUN=function(z) {isTRUE(all.equal(z, v[1]))}))

IsGoodId = function(Id){
  Sub = Data[Data$ID==Id,]
  if (length(Sub[,1]) > 1){
    return(compare(Sub[,"Size"]))
  }else{
    return(TRUE)
  }
}

WhichAreGood = mclapply(unique(Data$ID),IsGoodId)

但是我的四核i5却非常痛苦，可怕，缓慢.

But it's painfully, awfully, terribly slow on my quad-core i5.

谁能看到瓶颈在哪里?我是R优化的新手.

Can anyone see where the bottleneck is? I'm a newbie to R optimisation.

谢谢， -N

推荐答案

看起来您的算法进行了N ^ 2个比较.也许像下面这样的东西会更好地扩展.我们发现重复的销售，认为这只是总数的一小部分.

Looks like your algorithm makes N^2 comparisons. Maybe something like the following will scale better. We find the duplicate sales, thinking that this is a small subset of the total.

dups = unique(Data$ID[duplicated(Data$ID)])
DupData = Data[Data$ID %in% dups,,drop=FALSE]

%in%运算符的伸缩性很好.然后根据ID分割尺寸列，检查ID的尺寸是否超过一个尺寸

The %in% operator scales very well. Then split the size column based on id, checking for id's with more than one size

tapply(DupData$Size, DupData$ID, function(x) length(unique(x)) != 1)

这将给出一个命名逻辑向量，其中TRUE表示每个id的大小超过一个.这与重复销售的数量大致成线性比例.有一些聪明的方法可以使此过程快速进行，因此，如果您重复的数据本身很大...

This gives a named logical vector, with TRUE indicating that there is more than one size per id. This scales approximately linearly with the number of duplicate sales; there are clever ways to make this go fast, so if your duplicated data is itself big...

嗯，我想再想一想

u = unique(Data)
u$ID[duplicated(u$ID)]

起到了作用.

这篇关于R即使使用mclapply，检查重复项也非常缓慢的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R即使使用mclapply，检查重复项也非常缓慢 [英] R Checking for duplicates is painfully slow, even with mclapply

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R即使使用mclapply，检查重复项也非常缓慢 [英] R Checking for duplicates is painfully slow, even with mclapply

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭