c(...%*%...)和sum(... * ...)之间的差异 [英] Difference between c(... %*% ...) and sum(... * ...)

查看:97
本文介绍了c(...%*%...)和sum(... * ...)之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题是中使用 c(...%*%...) sum(... * ...)有什么区别> dplyr 中的> group_by()函数?

What is the difference between using c(... %*% ...) and sum(... * ...) in a group_by() function from dplyr?

这两个代码都给出相同的结果:

Both of these code give the same result:

#1

library(dplyr) # 1.0.0
library(tidyr)
df1 %>%
    group_by(Date, Market) %>% 
    group_by(Revenue = c(Quantity %*% Price), 
             TotalCost = c(Quantity %*% Cost),
             Product, .add = TRUE) %>% 
    summarise(Sold = sum(Quantity)) %>% 
    pivot_wider(names_from = Product, values_from = Sold)

#2

library(dplyr) # 1.0.0
library(tidyr)
df1 %>%
    group_by(Date, Market) %>% 
    group_by(Revenue = sum(Quantity * Price), 
             TotalCost = sum(Quantity * Cost),
             Product, .add = TRUE) %>% 
    summarise(Sold = sum(Quantity)) %>% 
    pivot_wider(names_from = Product, values_from = Sold)

# A tibble: 2 x 7
# Groups:   Date, Market, Revenue, TotalCost [2]
#  Date      Market Revenue TotalCost Apple Banana Orange
#  <chr>     <chr>    <dbl>     <dbl> <int>  <int>  <int>
#1 6/24/2020 A          135      37.5    35     20     20
#2 6/25/2020 A           25      15      10     15     NA

c(...%*%...) sum(... * ...)之一?

原始答案中的数据:

df1 <- structure(list(Date = c("6/24/2020", "6/24/2020", "6/24/2020", 
"6/24/2020", "6/25/2020", "6/25/2020"), Market = c("A", "A", 
"A", "A", "A", "A"), Salesman = c("MF", "RP", "RP", "FR", "MF", 
"MF"), Product = c("Apple", "Apple", "Banana", "Orange", "Apple", 
"Banana"), Quantity = c(20L, 15L, 20L, 20L, 10L, 15L), Price = c(1L, 
1L, 2L, 3L, 1L, 1L), Cost = c(0.5, 0.5, 0.5, 0.5, 0.6, 0.6)), 
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

推荐答案

我将这些注释编译为答案,如果我错过任何内容,其他人都可以加入.

I'll compile the comments into an answer, others can jump in if I miss anything.

  • %*% * 是完全不同的运算符: * 执行逐元素乘法,而%*%执行线性代数矩阵乘法.这些是非常不同的操作,演示如下:

  • %*% and * are drastically different operators: * does element-wise multiplication, and %*% does linear algebra matrix multiplication. Those are very different operations, demonstrated with:

1:4 * 2:5
# [1]  2  6 12 20

1:4 %*% 2:5
#      [,1]
# [1,]   40

sum(1:4 * 2:5)
# [1] 40

如果您要从两个向量相乘中查找单个汇总统计信息,并且从线性代数中求矩阵乘法就很有意义,那么%*%是适合您的工具.

If you are looking for a single summary statistic from multiply two vectors, and the matrix-multiply from linear algebra makes sense, then %*% is the right tool for you.

关于声明性代码应该说些什么;虽然您可以执行第三项操作( sum(.*.)),但对我来说,最好使用%*%,这有两个原因:

there should be something said about declarative code; while you can do the third operation (sum(.*.)), to me it may be better to use %*%, for two reasons:

  1. 声明性意图.我是说我有两个矩阵,我打算做线性代数";上.

  1. Declarative intent. I am saying that I have two matrices that I intend to do "linear algebra" on.

保护措施.如果存在任何尺寸不匹配的情况(例如, sum(1:4 * 2:3)仍在语法上起作用,但 1:4%*%2:3 则不能),我想马上知道.使用 sum(.*.),这种不匹配将被全世界默默忽略(我认为回收可能是一个大问题的一个原因).

Safeguards. If there is any dimensional mismatch (e.g., sum(1:4 * 2:3) still works syntactically but 1:4 %*% 2:3 does not), I want to know it right away. With sum(.*.), the mismatch is silently ignored to the world (one reason I think recycling can be a big problem).

原因不是 性能:而对于较小的向量/矩阵,%*%的性能与 sum(.*.),随着数据大小变大,%*%相对更昂贵.

The reason is not performance: while with smaller vectors/matrices %*%'s performance is on par with sum(.*.), as the size of the data gets larger, %*% is relatively more expensive.

m1 <- 1:100 ; m2 <- m1+1 ; m3 <- 1:100000; m4 <- m3+1
microbenchmark::microbenchmark(sm1 = sum(m1*m2), sm2 = m1%*%m2, lg1 = sum(m3*m4), lg2 = m3%*%m4)
# Unit: nanoseconds
#  expr    min     lq   mean median     uq      max neval
#   sm1    800   1100 112900   1600   2100 11083600   100
#   sm2   1100   1400   2143   1900   2450    10200   100
#   lg1 239700 249550 411235 270800 355300 11102800   100
#   lg2 547900 575550 634763 637850 678250   780500   100

  • 到目前为止所有的讨论都在 vectors 上,它们实际上是一维矩阵(就 %*% 似乎认为......甚至还不够准确).一旦开始使用真正的矩阵,交换它们就会变得更加困难...实际上,我不知道一种模拟%*%的简便方法(缺少 for 循环等):

  • All of the discussion so far has been on vectors, which are effectively 1d matrices (as far as %*% seems to think ... though even that is not fully accurate). Once you start getting into true matrices, it becomes more difficult to interchange them ... in fact, I don't know of an easier way to emulate %*% (short of for loops, etc):

    m1 %*% m2
    #      [,1] [,2] [,3] [,4]
    # [1,]   22   49   76  103
    # [2,]   28   64  100  136
    t(sapply(seq_len(nrow(m1)), function(i) sapply(seq_len(ncol(m2)), function(j) sum(m1[i,] * m2[,j]))))
    #      [,1] [,2] [,3] [,4]
    # [1,]   22   49   76  103
    # [2,]   28   64  100  136
    

    (虽然嵌套的- sapply 可能不是最快的非%*%方式来处理矩阵Y的东西,但%*%更快 1-2个数量级,因为它是 .Internal ,并针对"数学进行编译和 meant ." 这样.)

    (And while that nested-sapply may not be the fastest non-%*% way to do the matrix-y stuff, %*% is 1-2 orders of magnitude faster, since it is .Internal and compiled and meant for "Math!" like this.)

    最底线,而%*%确实在内部使用 * 运算符(对于其中几个步骤之一),而其他两个则不然不同的.哎呀,也许还可以用相同的方式比较 * ^ ...具有相似的结果.

    Bottom line, while %*% does use the * operator internally (for one of a couple steps), the two are otherwise different. Heck, one might also compare * and ^ in the same vein ... with a similar outcome.

    干杯!

    这篇关于c(...%*%...)和sum(... * ...)之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆