如何矢量化或以其他方式加速R中的循环逻辑? [英] How to vectorize or otherwise speed-up this looping logic in R?

查看:218
本文介绍了如何矢量化或以其他方式加速R中的循环逻辑?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我试图计算2套物品之间共同的物品,用于20M +物品数据集。示例数据如下所示。

  #serially编号的项目
父项< - rep(1:10000)

#generate rnorm儿童项目数量
numchild < - round(rnorm(10000,mean = 30,sd = 10))

#填写父子如果(numchild [x]> 0){
f1 parent_child < - list()
(x in 1:length(parents)) - 样本(1:长度(父母),大小= numchild [x])
f2 < - 列表(父母[f1])$ ​​b $ b parent_child < - c(parent_child,f2)
}
else {
parent_child < - c(parent_child,list(x + 1))#if numchild = 0,make something
}
}

以下是我想要做的事:说父项#1有5个子项 - 1,2,3,4 ,5和父项#2有3个子项 - 4,10,22。

我想计算每个(parent_i,parent_j)的长度(交集)组合。在这种情况下,这将是1普通项目 - 4。

我这样做的10M +父项平均有15-20个子项目与( 0,100)的范围。所以这是一个10M×10M的物品项矩阵。我有一个foreach循环,我正在测试一个较小的子集,但不能完全规模化全数据集(具有256GB RAM的64核心机器)。在下面的循环中,为了这个目的,我已经只计算了用户用户矩阵的一半 - >(parent_i,parent_j),它与(parent_j,parent_i)相同。

<$ p (a),(b),(b),(b),(b)和(c) .combine = rbind,.packages = c('foreach','doParallel'))%dopar%{
b < - a [[i]]
rest < - a [i +长度(a)]

foreach(j = 1:(length(rest)),.combine = rbind)%dopar%{
common < - length(intersect(b,rest (普通> 0){g < - data.frame(u1 = i,u2 = j + 1,common)}
}
}

我一直在试验这个变体(使用Reduce,在daataframe中存储父子),但没有太多的运气。



有没有一种方法可以使这个规模?

解决方案

我颠倒了拆分,以便我们有一个父子关系

  len <小号apply(parent_child,length)
child_parent< - split(rep(seq_along(parent_child),len),
unlist(parent_child,use.names = FALSE))

类似下面的内容构造了一对父母共享一个孩子的字符串,跨所有孩子

 保持<  -  sapply(child_parent,length)> 1 
int < - lapply(child_parent [keep],function(x){
x < - combn(sort(x),2)
paste(x [1,],x [2,],sep =。)
})



 表(unlist(int,use.names = FALSE))

或者更快一点

  xx < -  unlist(int ,use.names = FALSE)
nms < - unique(xx)
cnt < - match(xx,nms)
setNames(tabulate(cnt,length(nms),nms)

for

  f1 < -  function(parent_child){
len < - sapply(parent_child,length)
child_parent < - split(rep(seq_along(parent_child),len),$ b $ (b)unlist(parent_child,use.names = FALSE))

keep< - sapply(child_parent,length)> 1
int< - lapply(child_parent [keep],function x){
x< - combn(sort(x),2)
paste(x [1,],x [2,],sep =。)
})

xx< - unlist(int,use.names = FALSE)
nms < - unique(xx)
cnt < - match(xx,nms)
setNames(tabulate(cnt,length(nms) ),nms)
}

(这是针对所有10000个父子元素的)

 > system.time(ans1 < -  f1(parent_child))
用户系统经过的
14.625 0.012 14.668
> (ans1)
542.1611 542.1832 542.2135 542.2435 542.2527 542.2806
1 1 1 1 1 1

我不确定这是否真的会缩小到你所说的问题的大小,但是这是每个孩子的父母数量的多项式。

加速的一种可能性是记忆组合计算,使用参数的长度作为键并将组合存储为值。这减少了 combn 被调用到child_parent元素的唯一长度的次数。

  combn1<  -  local({
memo< - new.env(parent = emptyenv())
function(x){
key< - as字符(长度(x))
if(!exists(key,memo))
memo [[key]] <-t(combn(length(x),2))
粘贴(x [备忘[[key]] [,1]],x [备忘[[key]] [,2]],sep =。)
}
})

f2 < - function(parent_child){
len < - sapply(parent_child,length)
child_parent < - split(rep(seq_along(parent_child),len),
unlist(parent_child,use.names = FALSE))

keep< - sapply(child_parent,length)> 1
int < - lapply(child_parent [keep],combn1)

xx< - unlist(int,use.names = FALSE)
nms< - unique xx)
cnt < - match(xx,nms)
setNames(tabulate(cnt,length(nms)),nms)
}

有些帮助

 > system.time(ans2 < -  f2(parent_child))
用户系统已用
5.337 0.000 5.347
>相同的(ans1,ans2)
[1] TRUE

c $ c> paste

 > Rprof(); ans2 <-f2(parent_child); Rprof(NULL); summaryRprof()
$ by.self
self.time self.pct total.time total.pct
paste3.92 73.41 3.92 73.41
match0.74 13.86 0.74 13.86
unique.default0.40 7.49 0.40 7.49
as.character0.08 1.50 0.08 1.50
unlist0.08 1.50 0.08 1.50
combn0.06 1.12 0.06 1.12
lapply0.02 0.37 4.00 74.91
任何0.02 0.37 0.02 0.37
setNames0.02 0.37 0.02 0.37

$ by.total
...

我们可以通过将父共享子id编码成一个整数来避免这种情况。由于浮点数用R表示的方式,这将是精确的,直到大约2 ^ 21

 编码<  - 函数(x,y,n)
(x-1)*(n + 1)+ y
解码<函数(n + 1)),y = z %%(n + 1))

combn1和f2的功能如下:

$ p $ combn2< - local({
memo< - new.env(parent =如果(!存在(密钥,备忘录))
($($))
函数(x,encode_n){
key< - as.character(length(x))
备注[[key]] < - t(combn(length(x),2))
encode(x [memo [[key]] [,1]],x [memo [[key] ,2)],encode_n)
}
})

f3 < - 函数(parent_child){
encode_n < - length(parent_child)
len < - sapply(parent_child,length)
child_parent < -
unname(split(rep(seq_along(parent_child),len),
unlist(parent_child,use.names = FALSE)))

保持< - sapply(child_parent,length)> 1
int < - lapply(child_parent [keep],combn2,encode_n)

id < - unlist(int,use.names = FALSE)
uid < unique(xx)
n < - tabulate(match(xx,uid),length(uid))
do.call(data.frame,c(decode(uid,encode_n),list(n = n)))
}

导致

 > system.time(f3(parent_child))
用户系统流逝
2.140 0.000 2.146

这个比较非常有利(注意上一行的时间是10000个亲子关系),jlhoward的修订答案是

 > system.time(result.3 < -  do.call(rbind,lapply(1:99,gg)))
用户系统经过
2.465 0.000 2.468
> system.time(f3(parent_child [1:99]))
用户系统已用完
0.016 0.000 0.014

并以更合理的方式进行缩放。

值得一提的是,数据生成例程位于Patrick Burn的第二个圈> R Inferno ,使用copy-and-append算法,而不是预先分配空间并填入空格。避免为循环写入身体作为功能,并使用lapply。通过修复问题,避免在循环中使用复杂的条件
$ b

  numchild < -  round(rnorm(10000,mean = 30,sd = 10))
numchild [numchild< 0] < - sample(numchild [numchild> 0],sum(numchild <0))



<或者通过从产生正整数值的分布(rpois,rbinom)抽样。数据生成然后是

  n_parents < -  10000 
numchild < - round(rnorm(n_parents,mean = 30 ,sd = 10))
numchild [numchild< 0]< - sample(numchild [numchild> 0],sum(numchild< 0))
parent_child < - lapply(numchild,sample,x = n_parents)


Long time lurker, first time asker.

I'm trying to calculate 'items in common between 2 sets of items' for a 20M+ items dataset. Sample data looks like this.

#serially numbered items
parents <- rep(1:10000)

#generate rnorm # of children items
numchild <- round(rnorm(10000, mean=30, sd=10))

#fill the parent-child list
parent_child <- list()
for (x in 1:length(parents)){
  if (numchild[x]>0){
    f1 <- sample(1:length(parents), size=numchild[x])
    f2 <- list(parents[f1])
    parent_child <- c(parent_child, f2)
  }
  else {
    parent_child <- c(parent_child, list(x+1))    #if numchild=0, make up something
  }
}

Here is what I want to do: say parent item #1 has 5 children items-- 1,2,3,4,5 and parent item #2 has 3 children item-- 4,10,22.

I want to compute the length(intersection) of every (parent_i, parent_j) combination. In the above case, it would be 1 common item-- 4.

I am doing this for 10M+ parent items that on average have 15-20 children items with a (0,100) range. So that's a 10M x 10M item-item matrix.

I have a foreach loop that I am testing out on a smaller subset that works but doesn't quite scale for the full dataset (64 core machine with 256GB RAM). With the loop below I am already computing only half of the user-user matrix--> (parent_i, parent_j) same as (parent_j, parent_i) for this purpose.

#small subset
a <- parent_child[1:1000]

outerresults <- foreach (i = 1:(length(a)), .combine=rbind, .packages=c('foreach','doParallel')) %dopar% {
  b <- a[[i]]
  rest <- a[i+1:length(a)]

  foreach (j = 1:(length(rest)), .combine=rbind) %dopar% {
    common <- length(intersect(b, rest[[j]]))
    if (common > 0) {g <- data.frame(u1=i, u2=j+1, common)}
  }  
}

I've been experimenting variations on this (using Reduce, storing parent-children in a daataframe etc.) but haven't had much luck.

Is there a way to make this scale?

解决方案

I reversed the split, so that we have a child-parent relationship

len <- sapply(parent_child, length)
child_parent <- split(rep(seq_along(parent_child), len), 
                      unlist(parent_child, use.names=FALSE))

Something like the following constructs a string with pairs of parents sharing a child, across all children

keep <- sapply(child_parent, length) > 1
int <- lapply(child_parent[keep], function(x) {
    x <- combn(sort(x), 2)
    paste(x[1,], x[2,], sep=".")
})

and tallying

table(unlist(int, use.names=FALSE))

or a little more quickly

xx <- unlist(int, use.names=FALSE)
nms <- unique(xx)
cnt <- match(xx, nms)
setNames(tabulate(cnt, length(nms), nms)

for

f1 <- function(parent_child) {
    len <- sapply(parent_child, length)
    child_parent <- split(rep(seq_along(parent_child), len), 
                          unlist(parent_child, use.names=FALSE))

    keep <- sapply(child_parent, length) > 1
    int <- lapply(child_parent[keep], function(x) {
        x <- combn(sort(x), 2)
        paste(x[1,], x[2,], sep=".")
    })

    xx <- unlist(int, use.names=FALSE)
    nms <- unique(xx)
    cnt <- match(xx, nms)
    setNames(tabulate(cnt, length(nms)), nms)
}

with (this is for all 10000 parent-child elements)

> system.time(ans1 <- f1(parent_child))
   user  system elapsed 
 14.625   0.012  14.668 
> head(ans1)
542.1611 542.1832 542.2135 542.2435 542.2527 542.2806 
       1        1        1        1        1        1 

I'm not sure that this would really scale to the size of problem you're talking about, though -- it's polynomial in the number of parents per child.

One possibility for speed-up is to 'memoize' the combinatorial calculation, using the length of the argument as a 'key' and storing the combination as 'value'. This reduces the number of times combn is called to the number of unique lengths of elements of child_parent.

combn1 <- local({
    memo <- new.env(parent=emptyenv())
    function(x) {
        key <- as.character(length(x))
        if (!exists(key, memo))
            memo[[key]] <- t(combn(length(x), 2))
        paste(x[memo[[key]][,1]], x[memo[[key]][,2]], sep=".")
    }
})

f2 <- function(parent_child) {
    len <- sapply(parent_child, length)
    child_parent <- split(rep(seq_along(parent_child), len), 
                          unlist(parent_child, use.names=FALSE))

    keep <- sapply(child_parent, length) > 1
    int <- lapply(child_parent[keep], combn1)

    xx <- unlist(int, use.names=FALSE)
    nms <- unique(xx)
    cnt <- match(xx, nms)
    setNames(tabulate(cnt, length(nms)), nms)
}

which helps somewhat

>     system.time(ans2 <- f2(parent_child))
   user  system elapsed 
  5.337   0.000   5.347 
>     identical(ans1, ans2)
[1] TRUE

The slow part is now paste

>     Rprof(); ans2 <- f2(parent_child); Rprof(NULL); summaryRprof()
$by.self
                 self.time self.pct total.time total.pct
"paste"               3.92    73.41       3.92     73.41
"match"               0.74    13.86       0.74     13.86
"unique.default"      0.40     7.49       0.40      7.49
"as.character"        0.08     1.50       0.08      1.50
"unlist"              0.08     1.50       0.08      1.50
"combn"               0.06     1.12       0.06      1.12
"lapply"              0.02     0.37       4.00     74.91
"any"                 0.02     0.37       0.02      0.37
"setNames"            0.02     0.37       0.02      0.37

$by.total
...

We can avoid this by encoding the parents with shared child id into a single integer; because of the way floating point numbers are represented in R, this will be exact until about 2^21

encode <- function(x, y, n)
    (x - 1) * (n + 1) + y
decode <- function(z, n)
    list(x=ceiling(z / (n + 1)), y = z %% (n + 1))

and adjusting our combn1 and f2 functions as

combn2 <- local({
    memo <- new.env(parent=emptyenv())
    function(x, encode_n) {
        key <- as.character(length(x))
        if (!exists(key, memo))
            memo[[key]] <- t(combn(length(x), 2))
        encode(x[memo[[key]][,1]], x[memo[[key]][,2]], encode_n)
    }
})

f3 <- function(parent_child) {
    encode_n <- length(parent_child)
    len <- sapply(parent_child, length)
    child_parent <-
        unname(split(rep(seq_along(parent_child), len), 
                     unlist(parent_child, use.names=FALSE)))

    keep <- sapply(child_parent, length) > 1
    int <- lapply(child_parent[keep], combn2, encode_n)

    id <- unlist(int, use.names=FALSE)
    uid <- unique(xx)
    n <- tabulate(match(xx, uid), length(uid))
    do.call(data.frame, c(decode(uid, encode_n), list(n=n)))
}

leading to

> system.time(f3(parent_child))
   user  system elapsed 
  2.140   0.000   2.146 

This compares very favorably (note that the timing in the previous line is for 10,000 parent-child relations) with jlhoward's revised answer

> system.time(result.3 <- do.call("rbind",lapply(1:99,gg)))
   user  system elapsed 
  2.465   0.000   2.468
> system.time(f3(parent_child[1:99]))
   user  system elapsed 
  0.016   0.000   0.014 

and scales in a much more reasonable way.

For what it's worth, the data generation routine is in the second circle of Patrick Burn's R Inferno, using the 'copy-and-append' algorithm rather than pre-allocating the space and filling it in. Avoid this by writing the for loop body as a function, and using lapply. Avoid the need for the complicated conditional in the for loop by fixing the issue before-hand

numchild <- round(rnorm(10000, mean=30, sd=10))
numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))

or by sampling from a distribution (rpois, rbinom) that generates positive integer values. Data generation is then

n_parents <- 10000
numchild <- round(rnorm(n_parents, mean=30, sd=10))
numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))
parent_child <- lapply(numchild, sample, x=n_parents)

这篇关于如何矢量化或以其他方式加速R中的循环逻辑?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆