dplyr按行和和其他函数,例如max [英] dplyr rowwise sum and other functions like max
问题描述
如果我想使用 dplyr
对数据框中的某些变量求和,我可以这样做:
If I wanted to sum over some variables in a data-frame using dplyr
, I could do:
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> select(iris, starts_with('Petal')) %>% rowSums()
[1] 1.6 1.6 1.5 1.7 1.6 2.1 1.7 1.7 1.6 1.6 1.7 1.8 1.5 1.2 1.4 1.9 1.7 1.7 2.0 1.8 1.9 1.9 1.2 2.2 2.1 1.8 2.0 1.7 1.6 1.8 1.8 1.9 1.6 1.6 1.7 1.4
[37] 1.5 1.5 1.5 1.7 1.6 1.6 1.5 2.2 2.3 1.7 1.8 1.6 1.7 1.6 6.1 6.0 6.4 5.3 6.1 5.8 6.3 4.3 5.9 5.3 4.5 5.7 5.0 6.1 4.9 5.8 6.0 5.1 6.0 5.0 6.6 5.3
[73] 6.4 5.9 5.6 5.8 6.2 6.7 6.0 4.5 4.9 4.7 5.1 6.7 6.0 6.1 6.2 5.7 5.4 5.3 5.6 6.0 5.2 4.3 5.5 5.4 5.5 5.6 4.1 5.4 8.5 7.0 8.0 7.4 8.0 8.7 6.2 8.1
[109] 7.6 8.6 7.1 7.2 7.6 7.0 7.5 7.6 7.3 8.9 9.2 6.5 8.0 6.9 8.7 6.7 7.8 7.8 6.6 6.7 7.7 7.4 8.0 8.4 7.8 6.6 7.0 8.4 8.0 7.3 6.6 7.5 8.0 7.4 7.0 8.2
[145] 8.2 7.5 6.9 7.2 7.7 6.9
这很好,但是我会认为是行进的
完成相同的操作,但没有完成,
That's fine, but I would have thought rowwise
accomplishes the same thing, but it doesn't,
> select(iris, starts_with('Petal')) %>% rowwise() %>% sum()
[1] 743.6
我特别想做的是选择一组列,并创建一个新变量,每个变量的值是所选列每一行的最大值。例如,如果我选择花瓣列,则最大值将为1.4、1.4、1.3,依此类推。
What I particularly want to do is select a set of columns, and create a new variable each value of which is the maximum value of each row of the selected columns. For example, if I selected the "Petal" columns, by maximum values would be 1.4, 1.4, 1.3 and so on.
我可以这样:
> select(iris, starts_with('Petal')) %>% apply(1, max)
很好。但是我很好奇为什么 rowwise
方法行不通。我知道我错误地使用了 rowwise
,只是不确定为什么是错误的。
and that's fine. But I'm just curious as to why the rowwise
approach doesn't work. I realize I am using rowwise
incorrectly, I'm just not sure why it is wrong.
推荐答案
简而言之:您期望 sum函数知道 dplyr
数据结构,例如按行分组的数据帧。 sum
并不知道,所以它只使用整个 data.frame
的总和。
In short: you are expecting the "sum" function to be aware of dplyr
data structures like a data frame grouped by row. sum
is not aware of it so it just takes the sum of the whole data.frame
.
这里是一个简短的解释。
Here is a brief explanation. This:
select(iris, starts_with('Petal')) %>% rowwise() %>% sum()
无需使用以下管道运算符就可以重写:
Can be rewritten without using the pipe operator as the following:
data <- select(iris, starts_with('Petal'))
data <- rowwise(data)
sum(data)
正如您所看到的,您正在构建称为 tibble $的东西c $ c>。然后
rowwise
调用将添加有关此对象的更多信息,并指定应按行将其分组。
As you can see you were constructing something called a tibble
. Then the rowwise
call adds additional information on this object and specifies that it should be grouped row-wise.
但是只有了解此分组的功能(例如 summerize
和 mutate
)才能按预期工作。基本的R函数(例如 sum
)不知道这些对象,因此将它们视为任何标准的 data.frame
s。 sum()
的标准方法是对整个数据帧求和。
However only the functions aware of this grouping like summarize
and mutate
can work like intended. Base R functions like sum
are not aware of these objects and treat them as any standard data.frame
s. And the standard approach for sum()
is to sum the entire data frame.
使用 mutate
有效:
select(iris, starts_with('Petal')) %>%
rowwise() %>%
mutate(sum = sum(Petal.Width, Petal.Length))
结果:
Source: local data frame [150 x 3]
Groups: <by row>
# A tibble: 150 x 3
Petal.Length Petal.Width sum
<dbl> <dbl> <dbl>
1 1.40 0.200 1.60
2 1.40 0.200 1.60
3 1.30 0.200 1.50
...
这篇关于dplyr按行和和其他函数,例如max的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!