dplyr 按行求和和其他函数,如 max [英] dplyr rowwise sum and other functions like max
问题描述
如果我想使用 dplyr
对数据框中的某些变量求和,我可以这样做:
If I wanted to sum over some variables in a data-frame using dplyr
, I could do:
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> select(iris, starts_with('Petal')) %>% rowSums()
[1] 1.6 1.6 1.5 1.7 1.6 2.1 1.7 1.7 1.6 1.6 1.7 1.8 1.5 1.2 1.4 1.9 1.7 1.7 2.0 1.8 1.9 1.9 1.2 2.2 2.1 1.8 2.0 1.7 1.6 1.8 1.8 1.9 1.6 1.6 1.7 1.4
[37] 1.5 1.5 1.5 1.7 1.6 1.6 1.5 2.2 2.3 1.7 1.8 1.6 1.7 1.6 6.1 6.0 6.4 5.3 6.1 5.8 6.3 4.3 5.9 5.3 4.5 5.7 5.0 6.1 4.9 5.8 6.0 5.1 6.0 5.0 6.6 5.3
[73] 6.4 5.9 5.6 5.8 6.2 6.7 6.0 4.5 4.9 4.7 5.1 6.7 6.0 6.1 6.2 5.7 5.4 5.3 5.6 6.0 5.2 4.3 5.5 5.4 5.5 5.6 4.1 5.4 8.5 7.0 8.0 7.4 8.0 8.7 6.2 8.1
[109] 7.6 8.6 7.1 7.2 7.6 7.0 7.5 7.6 7.3 8.9 9.2 6.5 8.0 6.9 8.7 6.7 7.8 7.8 6.6 6.7 7.7 7.4 8.0 8.4 7.8 6.6 7.0 8.4 8.0 7.3 6.6 7.5 8.0 7.4 7.0 8.2
[145] 8.2 7.5 6.9 7.2 7.7 6.9
那很好,但我原以为 rowwise
可以完成同样的事情,但事实并非如此,
That's fine, but I would have thought rowwise
accomplishes the same thing, but it doesn't,
> select(iris, starts_with('Petal')) %>% rowwise() %>% sum()
[1] 743.6
我特别想做的是选择一组列,并创建一个新变量,每个值都是所选列的每一行的最大值.例如,如果我选择花瓣"列,最大值将为 1.4、1.4、1.3 等.
What I particularly want to do is select a set of columns, and create a new variable each value of which is the maximum value of each row of the selected columns. For example, if I selected the "Petal" columns, by maximum values would be 1.4, 1.4, 1.3 and so on.
我可以这样做:
> select(iris, starts_with('Petal')) %>% apply(1, max)
没关系.但我只是好奇为什么 rowwise
方法不起作用.我意识到我错误地使用了 rowwise
,我只是不确定为什么它是错误的.
and that's fine. But I'm just curious as to why the rowwise
approach doesn't work. I realize I am using rowwise
incorrectly, I'm just not sure why it is wrong.
推荐答案
简而言之:您希望sum"函数能够识别 dplyr
数据结构,例如按行分组的数据框.sum
不知道它,所以它只是取整个 data.frame
的总和.
In short: you are expecting the "sum" function to be aware of dplyr
data structures like a data frame grouped by row. sum
is not aware of it so it just takes the sum of the whole data.frame
.
这里是一个简单的解释.这:
Here is a brief explanation. This:
select(iris, starts_with('Petal')) %>% rowwise() %>% sum()
可以在不使用管道运算符的情况下重写如下:
Can be rewritten without using the pipe operator as the following:
data <- select(iris, starts_with('Petal'))
data <- rowwise(data)
sum(data)
如您所见,您正在构建一种称为 tibble
的东西.然后 rowwise
调用添加有关此对象的附加信息,并指定它应按行分组.
As you can see you were constructing something called a tibble
. Then the rowwise
call adds additional information on this object and specifies that it should be grouped row-wise.
然而,只有像 summarize
和 mutate
这样知道这种分组的函数才能按预期工作.像 sum
这样的基本 R 函数不知道这些对象,并将它们视为任何标准的 data.frame
.sum()
的标准方法是对整个数据帧求和.
However only the functions aware of this grouping like summarize
and mutate
can work like intended. Base R functions like sum
are not aware of these objects and treat them as any standard data.frame
s. And the standard approach for sum()
is to sum the entire data frame.
使用 mutate
有效:
select(iris, starts_with('Petal')) %>%
rowwise() %>%
mutate(sum = sum(Petal.Width, Petal.Length))
结果:
Source: local data frame [150 x 3]
Groups: <by row>
# A tibble: 150 x 3
Petal.Length Petal.Width sum
<dbl> <dbl> <dbl>
1 1.40 0.200 1.60
2 1.40 0.200 1.60
3 1.30 0.200 1.50
...
这篇关于dplyr 按行求和和其他函数,如 max的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!