Dplyr对分组中的值进行计算,将每个项目与该组中的所有其他项目进行比较 [英] Dplyr applying a calculation on values in a grouping comparing each item to all _other_ items in the group
问题描述
我想弄清楚分组中的值是否与分组中的其他值足够不同。具体来说,我想算出同一位学生的比赛结束时间是否与同一天另一堂课的开始时间相匹配。使用钻石,这是等效的代码:
I want to work out whether a value in a grouping is different enough from other values in a grouping. Specifically I want to work out whether an end time of a matches with the start time of another lesson on the same day for the same student. Using diamonds, this is the equivalent code:
library(ggplot2)
diamonds %>% group_by(color, cut) %>%
mutate(clash = sum(
lapply(
diamonds %>%
filter(color == color, cut == cut, carat != carat) %$% carat,
function(x) ifelse(x < carat - 0.01 && x > carat + 0.01, 1, 0)))) %>%
arrange(color, cut, clash)
计划是如果冲突超过1,那么我知道另一个钻石的克拉大小与该分组中的钻石非常接近。这给了我以下错误:
The plan is if clash is over 1, then I know that another diamond is very close in carat size to the diamond in that grouping. This gives me the following error:
Error in sum(sapply(diamonds %>% filter(color == color, cut == cut, carat != :
invalid 'type' (list) of argument
的无效类型(列表)这使得对钻石的第二次调用显得不可靠
This makes the second call to diamond look dodgy
推荐答案
您可以使用 pmap
而不是 lapply
,它更适合 tidyverse
内:
you can use pmap
instead lapply
which fits better inside the tidyverse
:
library(tidyverse)
myfun <- function(.color, .cut, .carat){
diamonds %>%
filter(color == .color, cut == .cut, !between(carat, .carat - 0.01, .carat + 0.01)) %>%
nrow()
}
diamonds %>%
mutate(clash = pmap_int(list(color, cut, carat), myfun)) %>%
arrange(color, cut, clash)
# A tibble: 53,940 x 11
carat cut color clarity depth table price x y z clash
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
1 1.01 Fair D SI2 64.6 56 3003 6.31 6.24 4.05 124
2 1.01 Fair D SI2 64.7 57 3871 6.31 6.27 4.07 124
3 1.01 Fair D SI1 66.3 55 4118 6.22 6.17 4.11 124
4 1.01 Fair D SI2 65.3 55 4205 6.33 6.19 4.09 124
5 1.01 Fair D SI1 65.9 60 4276 6.32 6.18 4.12 124
6 1.01 Fair D SI2 64.6 62 4538 6.26 6.21 4.03 124
7 1.01 Fair D SI1 63.5 58 4751 6.35 6.25 4 124
8 1.01 Fair D SI1 64.6 60 4751 6.12 6.08 3.94 124
9 1.01 Fair D SI1 66.9 54 4751 6.25 6.21 4.17 124
10 1.01 Fair D SI1 66.2 56 5122 6.05 6.1 4.02 124
请注意,此解决方案有效,但效率不高。您可以轻松地修改此代码以按组操作:
Note that this solution works but is not very efficient. You can easily modify this code to operate groupwise:
diamonds2 <- diamonds %>%
count(color, carat, cut)
myfun2 <- function(.color, .cut, .carat){
diamonds2 %>%
filter(color == .color, cut == .cut, !between(carat, .carat - 0.01, .carat + 0.01)) %>%
pull(n) %>% sum
}
diamonds2 %>%
mutate(clash = pmap_int(list(color, cut, carat), myfun2)) %>%
left_join(diamonds, ., by = c("color", "carat", "cut")) %>%
arrange(color, cut, clash)
结果是相同的,但是第二个版本(使用 myfun2
)要快得多。
The result is the same, but the second version (using myfun2
) is way faster.
要查看我们还使用 clarity
进行过滤的示例,请参见以下示例:
To see an example where we also use clarity
to filter see this example:
diamonds3 <- diamonds %>%
count(color, carat, cut, clarity)
myfun3 <- function(.color, .cut, .carat, .clarity){
diamonds3 %>%
filter(color == .color, cut == .cut, clarity == .clarity,
!between(carat, .carat - 0.01, .carat + 0.01)) %>%
pull(n) %>% sum
}
myfun3(.color = "D", .cut == "Fair", .clarity = "I1", .carat = 1.5)
[1] 3
这篇关于Dplyr对分组中的值进行计算,将每个项目与该组中的所有其他项目进行比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!