Dplyr对分组中的值进行计算,将每个项目与该组中的所有其他项目进行比较 [英] Dplyr applying a calculation on values in a grouping comparing each item to all _other_ items in the group

查看:80
本文介绍了Dplyr对分组中的值进行计算,将每个项目与该组中的所有其他项目进行比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想弄清楚分组中的值是否与分组中的其他值足够不同。具体来说,我想算出同一位学生的比赛结束时间是否与同一天另一堂课的开始时间相匹配。使用钻石,这是等效的代码:

I want to work out whether a value in a grouping is different enough from other values in a grouping. Specifically I want to work out whether an end time of a matches with the start time of another lesson on the same day for the same student. Using diamonds, this is the equivalent code:

library(ggplot2)
diamonds %>% group_by(color, cut) %>% 
  mutate(clash = sum(
           lapply(
             diamonds %>% 
               filter(color == color, cut == cut, carat != carat) %$% carat,
             function(x) ifelse(x < carat - 0.01 && x > carat + 0.01, 1, 0)))) %>%
  arrange(color, cut, clash)

计划是如果冲突超过1,那么我知道另一个钻石的克拉大小与该分组中的钻石非常接近。这给了我以下错误:

The plan is if clash is over 1, then I know that another diamond is very close in carat size to the diamond in that grouping. This gives me the following error:

Error in sum(sapply(diamonds %>% filter(color == color, cut == cut, carat !=  : 
  invalid 'type' (list) of argument

的无效类型(列表)这使得对钻石的第二次调用显得不可靠

This makes the second call to diamond look dodgy

推荐答案

您可以使用 pmap 而不是 lapply ,它更适合 tidyverse 内:

you can use pmap instead lapply which fits better inside the tidyverse:

library(tidyverse)

myfun <- function(.color, .cut, .carat){
 diamonds %>%
    filter(color == .color, cut == .cut, !between(carat, .carat - 0.01, .carat + 0.01)) %>%
    nrow()
}

diamonds %>% 
  mutate(clash = pmap_int(list(color, cut, carat), myfun)) %>%
  arrange(color, cut, clash)

# A tibble: 53,940 x 11
   carat cut   color clarity depth table price     x     y     z clash
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
 1  1.01 Fair  D     SI2      64.6    56  3003  6.31  6.24  4.05   124
 2  1.01 Fair  D     SI2      64.7    57  3871  6.31  6.27  4.07   124
 3  1.01 Fair  D     SI1      66.3    55  4118  6.22  6.17  4.11   124
 4  1.01 Fair  D     SI2      65.3    55  4205  6.33  6.19  4.09   124
 5  1.01 Fair  D     SI1      65.9    60  4276  6.32  6.18  4.12   124
 6  1.01 Fair  D     SI2      64.6    62  4538  6.26  6.21  4.03   124
 7  1.01 Fair  D     SI1      63.5    58  4751  6.35  6.25  4      124
 8  1.01 Fair  D     SI1      64.6    60  4751  6.12  6.08  3.94   124
 9  1.01 Fair  D     SI1      66.9    54  4751  6.25  6.21  4.17   124
10  1.01 Fair  D     SI1      66.2    56  5122  6.05  6.1   4.02   124

请注意,此解决方案有效,但效率不高。您可以轻松地修改此代码以按组操作:

Note that this solution works but is not very efficient. You can easily modify this code to operate groupwise:

diamonds2 <- diamonds %>%
  count(color, carat, cut)

myfun2 <- function(.color, .cut, .carat){
  diamonds2 %>%
    filter(color == .color, cut == .cut, !between(carat, .carat - 0.01, .carat + 0.01)) %>%
    pull(n) %>% sum
}

diamonds2 %>% 
  mutate(clash = pmap_int(list(color, cut, carat), myfun2)) %>%
  left_join(diamonds, ., by = c("color", "carat", "cut")) %>%
  arrange(color, cut, clash)

结果是相同的,但是第二个版本(使用 myfun2 )要快得多。

The result is the same, but the second version (using myfun2) is way faster.

要查看我们还使用 clarity 进行过滤的示例,请参见以下示例:

To see an example where we also use clarity to filter see this example:

diamonds3 <- diamonds %>%
  count(color, carat, cut, clarity)


myfun3 <- function(.color, .cut, .carat, .clarity){
  diamonds3 %>%
    filter(color == .color, cut == .cut, clarity == .clarity, 
           !between(carat, .carat - 0.01, .carat + 0.01)) %>%
    pull(n) %>% sum
}

 myfun3(.color = "D", .cut == "Fair", .clarity = "I1", .carat = 1.5)   
[1] 3

这篇关于Dplyr对分组中的值进行计算,将每个项目与该组中的所有其他项目进行比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆