“D”中的“选择A,B,max(C)”的dplyr成语“C” [英] dplyr idiom for "select A, B, max(C) from D group by C"

查看:150
本文介绍了“D”中的“选择A,B,max(C)”的dplyr成语“C”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在通过具有多个结果列的查询寻找SQL组的dplyr成语。例如:

I am looking for a dplyr idiom for SQL group by queries with several result columns. For example:

library(dplyr)
library(sqldf)

df <- data.frame(
  fuel=rep(c("Coal", "Gas"), each=3), 
  year=rep(c(1998,1999,2000), 2),
  percent=c(20,30,40,80,70,60)) 

sqldf("select fuel, year, max(percent) from df group by fuel")

 fuel year max(percent)
 1 Coal 2000           40
 2  Gas 1998           80

sqldf提供给定燃料达到最大百分比(忽视关系)的年份。使用 dplyr 的最佳方法是什么?简单地说:

The sqldf supplies the year that a given fuel reached it's maximum percentage (ignoring ties). What is the best way to do this using dplyr? Simply doing:

group_by(df,fuel) %>% summarise(max(percent))

给出:

  fuel max(percent)
1 Coal           40
2  Gas           80

似乎不是添加额外结果列的地方。我可以使用 mutate 间接地执行此操作:

and there does not seem to be a place to add an extra result column. I can do it indirectly by using mutate:

group_by(df,fuel) %>% mutate(maxp=max(percent)) %>% 
   filter(percent==maxp) %>% select(-percent)

这是最好的/唯一的方式吗?

Is that the best/only way?

推荐答案

一些更多选项

使用 distinct (这与 slice(which.max (%)),但是由组操作避免,因此可能更有效)

Using distinct (this is similar to slice(which.max(percent)), but will avoid by group operations and hence probably more efficient)

df %>% 
  arrange(desc(percent)) %>%
  distinct(fuel)

#   fuel year percent
# 1  Gas 1998      80
# 2 Coal 2000      40 

或使用过滤器(这将选择全部具有最大值的行)

Or using filter (this will select all the rows having a maxima)

df %>% 
  group_by(fuel) %>% 
  filter(percent == max(percent))
# Source: local data frame [2 x 3]
# Groups: fuel [2]
# 
#     fuel  year percent
#   (fctr) (dbl)   (dbl)
# 1   Coal  2000      40
# 2    Gas  1998      80

或使用 top_n (类似的结果为 filter(percent == max(percent))

Or using top_n (similar result to filter(percent == max(percent)))

df %>% 
  group_by(fuel) %>% 
  top_n(n = 1, percent) # If percent is always the last column, you can just do top_n(n = 1)

# Source: local data frame [2 x 3]
# Groups: fuel [2]
# 
#     fuel  year percent
#   (fctr) (dbl)   (dbl)
# 1   Coal  2000      40
# 2    Gas  1998      80

或使用总结 left_join (与上述两个相似的结果)

Or using summarise and left_join (similar result as in the two above)

df %>% 
  group_by(fuel) %>%
  summarise(percent = max(percent)) %>%
  left_join(., df)

# Joining by: c("fuel", "percent")
# Source: local data frame [2 x 3]
# 
#     fuel percent  year
#   (fctr)   (dbl) (dbl)
# 1   Coal      40  2000
# 2    Gas      80  1998

这篇关于“D”中的“选择A,B,max(C)”的dplyr成语“C”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆