R即使使用mclapply,检查重复项也非常缓慢 [英] R Checking for duplicates is painfully slow, even with mclapply

查看:348
本文介绍了R即使使用mclapply,检查重复项也非常缓慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些涉及重复销售一堆具有独特ID的汽车的数据.一辆汽车可以卖不止一次.

I've got some data involving repeated sales for a bunch of of cars with unique Ids. A car can be sold more than once.

但是有些ID是错误的,因此我要检查每个ID的大小是否在多次销售中都记录为相同的大小.如果不是,那么我知道该ID是错误的.

Some of the Ids are erroneous however, so I'm checking, for each Id, if the size is recorded as the same over multiple sales. If it isn't, then I know that the Id is erroneous.

我正在尝试使用以下代码进行此操作:

I'm trying to do this with the following code:

library("doMC")

Data <- data.frame(ID=c(15432,67325,34623,15432,67325,34623),Size=c("Big","Med","Small","Big","Med","Big"))
compare <- function(v) all(sapply( as.list(v[-1]), FUN=function(z) {isTRUE(all.equal(z, v[1]))}))

IsGoodId = function(Id){
  Sub = Data[Data$ID==Id,]
  if (length(Sub[,1]) > 1){
    return(compare(Sub[,"Size"]))
  }else{
    return(TRUE)
  }
}

WhichAreGood = mclapply(unique(Data$ID),IsGoodId)

但是我的四核i5却非常痛苦,可怕,缓慢.

But it's painfully, awfully, terribly slow on my quad-core i5.

谁能看到瓶颈在哪里?我是R优化的新手.

Can anyone see where the bottleneck is? I'm a newbie to R optimisation.

谢谢, -N

推荐答案

看起来您的算法进行了N ^ 2个比较.也许像下面这样的东西会更好地扩展.我们发现重复的销售,认为这只是总数的一小部分.

Looks like your algorithm makes N^2 comparisons. Maybe something like the following will scale better. We find the duplicate sales, thinking that this is a small subset of the total.

dups = unique(Data$ID[duplicated(Data$ID)])
DupData = Data[Data$ID %in% dups,,drop=FALSE]

%in%运算符的伸缩性很好.然后根据ID分割尺寸列,检查ID的尺寸是否超过一个尺寸

The %in% operator scales very well. Then split the size column based on id, checking for id's with more than one size

tapply(DupData$Size, DupData$ID, function(x) length(unique(x)) != 1)

这将给出一个命名逻辑向量,其中TRUE表示每个id的大小超过一个.这与重复销售的数量大致成线性比例.有一些聪明的方法可以使此过程快速进行,因此,如果您重复的数据本身很大...

This gives a named logical vector, with TRUE indicating that there is more than one size per id. This scales approximately linearly with the number of duplicate sales; there are clever ways to make this go fast, so if your duplicated data is itself big...

嗯,我想再想一想

u = unique(Data)
u$ID[duplicated(u$ID)]

起到了作用.

这篇关于R即使使用mclapply,检查重复项也非常缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆