根据分组的data.frame计算每对 [英] Calculation on every pair from grouped data.frame

查看:51
本文介绍了根据分组的data.frame计算每对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是关于在data.frame中的每对组之间执行计算,我希望它可以进行矢量化处理。

My question is about performing a calculation between each pair of groups in a data.frame, I'd like it to be more vectorized.

我有一个数据包含以下各列的.frame:位置示例 Var1 Var2 。我想找到每个样本与每个<$ c $位置的每对 Location s的壁橱匹配c> Var1 和 Var2

I have a data.frame that has a consists of the following columns: Location , Sample , Var1, and Var2. I'd like to find the closet match for each Sample for each pair of Locations for both Var1 and Var2.

我可以在一对位置完成此操作

I can accomplish this for one pair of locations as such:

df0 <- data.frame(Location = rep(c("A", "B", "C"), each =30), 
                 Sample = rep(c(1:30), times =3),
                 Var1 = sample(1:25, 90, replace =T),
                 Var2 = sample(1:25, 90, replace=T))
df00 <- data.frame(Location = rep(c("A", "B", "C"), each =30), 
                 Sample = rep(c(31:60), times =3),
                 Var1 = sample(1:100, 90, replace =T),
                 Var2 = sample(1:100, 90, replace=T))
df000 <- rbind(df0, df00)
df <- sample_n(df000, 100) # data

dfl <- df %>% gather(VAR, value, 3:4)

df1 <- dfl %>% filter(Location == "A")
df2 <- dfl %>% filter(Location == "B")
df3 <- merge(df1, df2, by = c("VAR"), all.x = TRUE, allow.cartesian=TRUE)
df3 <- df3 %>% mutate(DIFF = abs(value.x-value.y))
result <- df3 %>% group_by(VAR, Sample.x) %>% top_n(-1, DIFF)

我尝试了其他可能性,例如使用 dplyr :: spread ,但无法避免出现错误:行的重复标识符 或用NA填充一半的列。

I tried other possibilities such as using dplyr::spread but could not avoid the "Error: Duplicate identifiers for rows" or columns half filled with NA.

对于每个可能的组对,是否有更干净,更自动化的方法?我想避免使用每对的手动子集和合并例程。

Is there a more clean and automated way to do this for each possible group pair? I'd like to avoid the manual subset and merge routine for each pair.

推荐答案

一种选择是创建位置与 combn ,然后按照OP的代码执行其他步骤

One option would be to create the pairwise combination of 'Location' with combn and then do the other steps as in the OP's code

 library(tidyverse)
 df %>% 
    # get the unique elements of Location
    distinct(Location) %>% 
    # pull the column as a vector
    pull %>% 
    # it is factor, so convert it to character
    as.character %>% 
    # get the pairwise combinations in a list
    combn(m = 2, simplify = FALSE) %>%
    # loop through the list with map and do the full_join
    # with the long format data df1
    map(~ full_join(df1 %>% 
                      filter(Location == first(.x)), 
                    df1 %>% 
                      filter(Location == last(.x)), by = "VAR") %>% 
             # create a column of absolute difference
             mutate(DIFF = abs(value.x - value.y)) %>%
             # grouped by VAR, Sample.x
             group_by(VAR, Sample.x) %>%
             # apply the top_n with wt as DIFF
             top_n(-1, DIFF))






正如OP提到的关于自动拾取而不是执行两次 filter (虽然不清楚预期的输出)


Also, as the OP mentioned about automatically picking up instead of doing double filter (not clear about the expected output though)

df %>% 
   distinct(Location) %>%
   pull %>%
   as.character %>% 
   combn(m = 2, simplify = FALSE) %>% 
   map(~ df1 %>% 
             # change here i.e. filter both the Locations
             filter(Location %in% .x) %>% 
             # spread it to wide format
             spread(Location, value, fill = 0) %>% 
             # create the DIFF column by taking the differene
             mutate(DIFF = abs(!! rlang::sym(first(.x)) - 
                              !! rlang::sym(last(.x)))) %>% 
             group_by(VAR, Sample) %>% 
             top_n(-1, DIFF))

这篇关于根据分组的data.frame计算每对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆