dplyr按行和和其他函数,例如max [英] dplyr rowwise sum and other functions like max

查看:88
本文介绍了dplyr按行和和其他函数,例如max的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我想使用 dplyr 对数据框中的某些变量求和,我可以这样做:

If I wanted to sum over some variables in a data-frame using dplyr, I could do:

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

> select(iris, starts_with('Petal')) %>% rowSums()
  [1] 1.6 1.6 1.5 1.7 1.6 2.1 1.7 1.7 1.6 1.6 1.7 1.8 1.5 1.2 1.4 1.9 1.7 1.7 2.0 1.8 1.9 1.9 1.2 2.2 2.1 1.8 2.0 1.7 1.6 1.8 1.8 1.9 1.6 1.6 1.7 1.4
 [37] 1.5 1.5 1.5 1.7 1.6 1.6 1.5 2.2 2.3 1.7 1.8 1.6 1.7 1.6 6.1 6.0 6.4 5.3 6.1 5.8 6.3 4.3 5.9 5.3 4.5 5.7 5.0 6.1 4.9 5.8 6.0 5.1 6.0 5.0 6.6 5.3
 [73] 6.4 5.9 5.6 5.8 6.2 6.7 6.0 4.5 4.9 4.7 5.1 6.7 6.0 6.1 6.2 5.7 5.4 5.3 5.6 6.0 5.2 4.3 5.5 5.4 5.5 5.6 4.1 5.4 8.5 7.0 8.0 7.4 8.0 8.7 6.2 8.1
[109] 7.6 8.6 7.1 7.2 7.6 7.0 7.5 7.6 7.3 8.9 9.2 6.5 8.0 6.9 8.7 6.7 7.8 7.8 6.6 6.7 7.7 7.4 8.0 8.4 7.8 6.6 7.0 8.4 8.0 7.3 6.6 7.5 8.0 7.4 7.0 8.2
[145] 8.2 7.5 6.9 7.2 7.7 6.9

这很好,但是我会认为是行进的完成相同的操作,但没有完成,

That's fine, but I would have thought rowwise accomplishes the same thing, but it doesn't,

> select(iris, starts_with('Petal')) %>% rowwise() %>% sum()
[1] 743.6

我特别想做的是选择一组列,并创建一个新变量,每个变量的值是所选列每一行的最大值。例如,如果我选择花瓣列,则最大值将为1.4、1.4、1.3,依此类推。

What I particularly want to do is select a set of columns, and create a new variable each value of which is the maximum value of each row of the selected columns. For example, if I selected the "Petal" columns, by maximum values would be 1.4, 1.4, 1.3 and so on.

我可以这样:

> select(iris, starts_with('Petal')) %>% apply(1, max)

很好。但是我很好奇为什么 rowwise 方法行不通。我知道我错误地使用了 rowwise ,只是不确定为什么是错误的。

and that's fine. But I'm just curious as to why the rowwise approach doesn't work. I realize I am using rowwise incorrectly, I'm just not sure why it is wrong.

推荐答案

简而言之:您期望 sum函数知道 dplyr 数据结构,例如按行分组的数据帧。 sum 并不知道,所以它只使用整个 data.frame 的总和。

In short: you are expecting the "sum" function to be aware of dplyr data structures like a data frame grouped by row. sum is not aware of it so it just takes the sum of the whole data.frame.

这里是一个简短的解释。

Here is a brief explanation. This:

select(iris, starts_with('Petal')) %>% rowwise() %>% sum()

无需使用以下管道运算符就可以重写:

Can be rewritten without using the pipe operator as the following:

data <- select(iris, starts_with('Petal'))
data <- rowwise(data)
sum(data)

正如您所看到的,您正在构建称为 tibble 。然后 rowwise 调用将添加有关此对象的更多信息,并指定应按行将其分组。

As you can see you were constructing something called a tibble. Then the rowwise call adds additional information on this object and specifies that it should be grouped row-wise.

但是只有了解此分组的功能(例如 summerize mutate )才能按预期工作。基本的R函数(例如 sum )不知道这些对象,因此将它们视为任何标准的 data.frame s。 sum()的标准方法是对整个数据帧求和。

However only the functions aware of this grouping like summarize and mutate can work like intended. Base R functions like sum are not aware of these objects and treat them as any standard data.frames. And the standard approach for sum() is to sum the entire data frame.

使用 mutate 有效:

select(iris, starts_with('Petal')) %>%
  rowwise() %>%
  mutate(sum = sum(Petal.Width, Petal.Length))

结果:

Source: local data frame [150 x 3]
Groups: <by row>

# A tibble: 150 x 3
   Petal.Length Petal.Width   sum
          <dbl>       <dbl> <dbl>
 1         1.40       0.200  1.60
 2         1.40       0.200  1.60
 3         1.30       0.200  1.50
 ...

这篇关于dplyr按行和和其他函数,例如max的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆