R:向量化循环以创建成对矩阵 [英] R: Vectorize loop to create pairwise matrix

查看:68
本文介绍了R:向量化循环以创建成对矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想加快创建成对矩阵的功能,该矩阵描述在一组位置中所有其他对象之前和之后一个对象被选择的次数.

这是 df 的示例:

  df <-data.frame(Shop = c("A","A","A","B","B","C","C","D","D","D","E","E","E"),水果= c(苹果",橙色",梨",橙色",梨",梨",苹果",梨",苹果",橙色",梨",苹果",橙色"),顺序 = c(1, 2, 3,一二一二1 2 31,1,1)) 

在每个 Shop 中,客户在给定的 Order 中选择 Fruit .

以下函数创建一个 m x n 成对矩阵:

  loop.function<-function(df){水果<-唯一(df $ Fruit)nt<-长度(果实)垫<-array(dim = c(nt,nt))for(1:nt中的m){for(n in 1:nt){##为每对水果过滤dfxm<-df [df $ Fruit ==水果[m],]xn<-df [df $ Fruit ==水果[n],]##在同一家商店采摘一对水果时的索引实例mm<-match(xm $ Shop,xn $ Shop)##根据毫米过滤xm和xnxm<-xm [!is.na(mm),]xn<-xn [mm [!is.na(mm)],]##将水果[n]之后采摘水果[m]的次数分配给垫[m,n]mat [m,n]<-sum(xn $ Order< xm $ Order)}}row.names(mat)<-水果colnames(mat)<-水果返回(垫)} 

其中 mat [m,n] 之后 水果[n]被摘录的 fruits [m] 的次数.而 mat [n,m] 是在之前 fruits [n] fruits [m] 的次数>.如果同时采摘成对的水果(例如在 Shop E 中),则不会记录.

查看预期输出:

 > loop.function(df)苹果橙梨苹果0 0 2橙色2 0 1梨1 2 0 

您可以在这里看到,在 apple (在 Shop C D ),并且在 pear (在 Shop A 中)之前选择了 apple .

我正在努力提高我的向量化知识,尤其是代替循环,因此我想知道如何对这个循环进行向量化.

(我觉得使用 outer()可能有解决方案,但是我对向量化功能的了解仍然非常有限.)

更新

有关 loop.function() tidyverse.function() loop的实际数据基准测试, times = 10000 .function2() datatable.function() loop.function.TMS():

 单位:毫秒expr min lq平均中位数uq max neval cld循环功能(日期)186.588600 202.78350 225.724249 215.56575 234.035750 999.8234 10000 etidyverse.function(dat)21.523400 22.93695 26.795815 23.67290 26.862700 295.7456 10000 cloop.function2(dat)119.695400 126.48825 142.568758 135.23555 148.876100 929.0066 10000 ddatatable.function(dat)8.517600 9.28085 10.644163 9.97835 10.766749 215.3245 10000 bloop.function.TMS(dat)4.482001 5.08030 5.916408 5.38215 5.833699 77.1935 10000个 

对我来说可能最有趣的结果是 tidyverse.function()在真实数据上的性能.我将不得不稍后再尝试添加 Rccp 解决方案-我无法使它们在实际数据上正常工作.

我很感谢这篇文章引起的所有兴趣和回答-我的目的是学习和改进性能,从给出的所有评论和解决方案中当然可以学到很多东西.谢谢!

解决方案

似乎无法对原始数据帧 df 进行矢量化.但是,如果您使用 reshape2 :: dcast()对其进行转换,则每个商店有一行:

  require(reshape2)df $ Fruit<-as.character(df $ Fruit)by_shop<-dcast(df,商店〜水果,value.var =订单")#店苹果橙梨#1 A 1 2 3#2 B不适用1 2#3 C 2 NA 1#4 D 2 3 1#5 E 1 1 1 

...,那么您可以轻松地至少对[m,n]的每种组合进行向量化:

 水果<-唯一(df $ Fruit)外层(水果,水果,向量化函数(m,n,by_shop)sum(by_shop [,m]> by_shop [,n],na.rm = TRUE),c("m","n")),by_shop)#[,1] [,2] [,3]#[1,] 0 0 2#[2,] 2 0 1#[3,] 1 2 0 

这可能是您想要对 outer 进行的解决方案.更快的解决方案是对水果[m,n]的所有组合进行真正的矢量化,但是我一直在考虑它,但是我看不到任何解决方法.因此,我不得不使用 Vectorize 函数,该功能当然比真正的矢量化要慢得多.

与原始功能进行基准比较:

 单位:毫秒expr min lq平均中位数uq max nevalloop.function(df)3.788794 3.926851 4.157606 4.002502 4.090898 9.529923 100loop.function.TMS(df)1.582858 1.625566 1.804140 1.670095 1.756671 8.569813 100 

功能和功能基准代码(还添加了暗号的保留):

require(reshape2)loop.function.TMS<-function(df){df $ Fruit<-as.character(df $ Fruit)by_shop<-dcast(df,商店〜水果,value.var =订单")水果<-唯一(df $ Fruit)o <-外(水果,水果,矢量化(函数(m,n,by_shop)sum(by_shop[,m]> by_shop[,n],na.rm = TRUE),c(m","; n)),by_shop)colnames(o)<-行名(o)<-水果Ø}要求(微基准测试)微基准(loop.function(df),loop.function.TMS(df)) 

I want to speed up a function for creating a pairwise matrix that describes the number of times an object is selected before and after all other objects, within a set of locations.

Here is an example df:

  df <- data.frame(Shop = c("A","A","A","B","B","C","C","D","D","D","E","E","E"),
                   Fruit = c("apple", "orange", "pear",
                             "orange", "pear",
                             "pear", "apple",
                             "pear", "apple", "orange",
                             "pear", "apple", "orange"),
                   Order = c(1, 2, 3,
                            1, 2,
                            1, 2, 
                            1, 2, 3,
                            1, 1, 1))

In each Shop, Fruit is picked by a customer in a given Order.

The following function creates an m x n pairwise matrix:

loop.function <- function(df){
  
  fruits <- unique(df$Fruit)
  nt <- length(fruits)
  mat <- array(dim=c(nt,nt))
  
  for(m in 1:nt){
    
    for(n in 1:nt){
      
      ## filter df for each pair of fruit
      xm <- df[df$Fruit == fruits[m],]
      xn <- df[df$Fruit == fruits[n],]
      
      ## index instances when a pair of fruit are picked in same shop
      mm <- match(xm$Shop, xn$Shop)
      
      ## filter xm and xn based on mm
      xm <- xm[! is.na(mm),]
      xn <- xn[mm[! is.na(mm)],]
      
      ## assign number of times fruit[m] is picked after fruit[n] to mat[m,n]
      mat[m,n] <- sum(xn$Order < xm$Order)
    }
  }
  
  row.names(mat) <- fruits
  colnames(mat) <- fruits
  
  return(mat)
}

Where mat[m,n] is the number of times fruits[m] is picked after fruits[n]. And mat[n,m] is the number of times fruits[m] is picked before fruits[n]. It is not recorded if pairs of fruit are picked at the same time (e.g. in Shop E).

See expected output:

>loop.function(df)
       apple orange pear
apple      0      0    2
orange     2      0    1
pear       1      2    0

You can see here that pear is chosen twice before apple (in Shop C and D), and apple is chosen once before pear (in Shop A).

I am trying to improve my knowledge of vectorization, especially in place of loops, so I want to know how this loop can be vectorized.

(I have a feeling there may be a solution using outer(), but my knowledge of vectorizing functions is still very limited.)

Update

See benchmarking with real data times = 10000 for loop.function(), tidyverse.function(), loop.function2(), datatable.function() and loop.function.TMS():

Unit: milliseconds
                    expr            min        lq       mean    median         uq      max     neval   cld
      loop.function(dat)     186.588600 202.78350 225.724249 215.56575 234.035750 999.8234    10000     e
     tidyverse.function(dat)  21.523400  22.93695  26.795815  23.67290  26.862700 295.7456    10000   c 
     loop.function2(dat)     119.695400 126.48825 142.568758 135.23555 148.876100 929.0066    10000    d
 datatable.function(dat)       8.517600   9.28085  10.644163   9.97835  10.766749 215.3245    10000  b 
  loop.function.TMS(dat)       4.482001   5.08030   5.916408   5.38215   5.833699  77.1935    10000 a 

Probably the most interesting result for me is the performance of tidyverse.function() on the real data. I will have to try add Rccp solutions at a later date - I'm having trouble making them work on the real data.

I appreciate all the interest and answers given to this post - my intention was to learn and improve performance, and there is certainly a lot to learn from all the comments and solutions given. Thanks!

解决方案

It seems not possible to vectorize over the original data frame df. But if you transform it using reshape2::dcast(), to have one line per each shop:

require(reshape2)

df$Fruit <- as.character(df$Fruit)

by_shop <- dcast(df, Shop ~ Fruit, value.var = "Order")

#   Shop apple orange pear
# 1    A     1      2    3
# 2    B    NA      1    2
# 3    C     2     NA    1
# 4    D     2      3    1
# 5    E     1      1    1

..., then you can easily vectorize at least for each combination of [m, n]:

fruits <- unique(df$Fruit)
outer(fruits, fruits, 
    Vectorize(
        function (m, n, by_shop) sum(by_shop[,m] > by_shop[,n], na.rm = TRUE), 
        c("m", "n")
    ), 
    by_shop)
#      [,1] [,2] [,3]
# [1,]    0    0    2
# [2,]    2    0    1
# [3,]    1    2    0

This is probably the solution you desired to do with outer. Much faster solution would be a true vectorization over all combinations of fruits [m, n], but I've been thinking about it and I don't see any way to do it. So I had to use the Vectorize function which of course is much slower than true vectorization.

Benchmark comparison with your original function:

Unit: milliseconds
                  expr      min       lq     mean   median       uq      max neval
     loop.function(df) 3.788794 3.926851 4.157606 4.002502 4.090898 9.529923   100
 loop.function.TMS(df) 1.582858 1.625566 1.804140 1.670095 1.756671 8.569813   100

Function & benchmark code (also added the preservation of the dimnames):

require(reshape2)   
loop.function.TMS <- function(df) { 
    df$Fruit <- as.character(df$Fruit)
    by_shop <- dcast(df, Shop ~ Fruit, value.var = "Order")
    fruits <- unique(df$Fruit)
    o <- outer(fruits, fruits, Vectorize(function (m, n, by_shop) sum(by_shop[,m] > by_shop[,n], na.rm = TRUE), c("m", "n")), by_shop)
    colnames(o) <- rownames(o) <- fruits
    o
}

require(microbenchmark)
microbenchmark(loop.function(df), loop.function.TMS(df))

这篇关于R:向量化循环以创建成对矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆