选择分组数据框中具有公共ID的行 [英] Select rows with common ids in grouped data frame

查看:73
本文介绍了选择分组数据框中具有公共ID的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找以下问题的更简单解决方案。这是我的设置:

I am searching for a simpler solution to the following problem. Here is my setup:

test <- tibble::tribble(
  ~group_name, ~id_name, ~varA, ~varB,
     "groupA",   "id_1",     1,   "a",
     "groupA",   "id_2",     4,   "f",
     "groupA",   "id_3",     5,   "g",
     "groupA",   "id_4",     6,   "x",
     "groupA",   "id_4",     6,   "h",
     "groupB",   "id_1",     2,   "s",
     "groupB",   "id_2",    13,   "y",
     "groupB",   "id_4",    14,   "t",
     "groupC",   "id_1",     3,   "d",
     "groupC",   "id_2",     7,   "j",
     "groupC",   "id_3",     8,   "k",
     "groupC",   "id_4",     9,   "l",
     "groupC",   "id_5",     0,   "o",
     "groupC",   "id_6",    12,   "u"
  )

我只想选择<$ c中的那些元素$ c> id_name 是 group_name 中所有组所共有的-即删除所有组中都不存在的id的行。我的实际数据很大(200k行),介于4-20个组之间(我事先不知道组的数量,因此该解决方案必须适用于任意数量的组)。每个组中的 id_name 不是唯一的。期望的结果将是:

I want to select only those elements in id_name that are common to all groups in group_name - i.e. drop the rows for ids that are not present in all the groups. My actual data is large (200k rows) with anywhere between 4-20 groups (I don't know the number of groups beforehand so the solution must work for any number of groups). The id_name in each group is NOT unique. The desired result would be:

test_result <- tibble::tribble(
  ~group_name, ~id_name, ~varA, ~varB,
     "groupA",   "id_1",     1,   "a",
     "groupA",   "id_2",     4,   "f",
     "groupA",   "id_4",     6,   "x",
     "groupA",   "id_4",     6,   "h",
     "groupB",   "id_1",     2,   "s",
     "groupB",   "id_2",    13,   "y",
     "groupB",   "id_4",    14,   "t",
     "groupC",   "id_1",     3,   "d",
     "groupC",   "id_2",     7,   "j",
     "groupC",   "id_4",     9,   "l",
  )

(删除至少一组中没有ID的行)。理想情况下,我不希望我的输出在末尾连接各列。我想简单地删除任何一组中缺少的行,但保持数据框的形状。

(the rows with ids absent in at least one group are dropped). Ideally I do not want my output to have the columns joined at the end. I want "simply" to drop the rows missing in any one group but maintain the shape of the dataframe.

我知道我可以从每组中提取所有ID ,然后将它们全部相交以获得所有组中存在的唯一ID列表,然后仅针对这些ID过滤主数据框。但这听起来像是很多工作;-)

And I know that I can extract all the ids from each group, then intersect them all to obtain the list of unique ids present in all groups and then filter the main dataframe for just these IDs. But that sounds like a lot of work ;-)

任何提示将不胜感激。

推荐答案

在基数R中,我们可以通过 分割 id_name group_name 找到常见的 id's ,然后找到子集

In base R, we can split id_name by group_name find common id's and then subset

subset(test, id_name %in% Reduce(intersect, split(id_name, group_name)))

#   group_name id_name  varA varB 
#   <chr>      <chr>   <dbl> <chr>
# 1 groupA     id_1        1 a    
# 2 groupA     id_2        4 f    
# 3 groupA     id_4        6 x    
# 4 groupA     id_4        6 h    
# 5 groupB     id_1        2 s    
# 6 groupB     id_2       13 y    
# 7 groupB     id_4       14 t    
# 8 groupC     id_1        3 d    
# 9 groupC     id_2        7 j    
#10 groupC     id_4        9 l    






中使用类似的概念tidyverse ,应该是

library(tidyverse)
test %>%
  filter(id_name %in% (test %>%
                         group_split(group_name)  %>%
                         map(~pull(., id_name)) %>%
                         reduce(intersect)))

这篇关于选择分组数据框中具有公共ID的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆