通过多种条件删除重复项 [英] Remove duplicates by multiple conditions
问题描述
我有一个数据,其中一个(名称)在卵期类别中多次出现。我希望每个人只有一个样本,但我不只是想保留R发现的第一个样本。我想保持该小组在所有其他类别中出现最多的那个。希望我的示例可以帮助您弄清楚这一点。
I have data where an individual (Name) appears multiple times in a eggphase category. I would like for there only to be one sample per individual but I don't just want to keep the first one the R finds. I would like to keep the one where the group appears most in all other categories. Hopefully my example helps make this clear.
library(tidyverse)
myDF <- read.table(text="Tissue Food Eggphase Name Group
wb fl after Kia a
wb fl after Kia c
wb wf before Kia b
wb fl before Lucy c
wb fl after Lucy b
wb fl after Lucy c
wb fl yolkdep Jess c
wb fl yolkdep Betty a
wb fl yolkdep Betty b", header = TRUE)
我只想保留曾经按组织,食物和蛋相分组的名称出现的行,但我想选择组所在的行
I would like to just keep the rows where Name appears once grouped by Tissue, Food and Eggphase BUT I want to select the row where Group appears in most if not all different eggphases (with the same Tissue and Food combinations).
#results I want
Tissue Food Eggphase Name Group
1 wb fl after Kia c
2 wb wf before Kia b
3 wb fl before Lucy c
4 wb fl after Lucy c
5 wb fl yolkdep Jess c
6 wb fl yolkdep Betty b
我尝试过
one_bird <- myDF %>%
distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE)
,但仅保留第一个条目
Tissue Food Eggphase Name Group
1 wb fl after Kia a
2 wb wf before Kia b
3 wb fl before Lucy c
4 wb fl after Lucy b
5 wb fl yolkdep Jess c
6 wb fl yolkdep Betty b
关于如何分辨它的任何想法,请选择行组
出现在组织
食物$中的大多数(如果不是全部)卵期中c $ c>组合?
和
在我的示例中,出现在<$ c $组织的组织
和食物
组合中最多的组c> wb fl
是 c
和 b
,但起亚
没有出现在 Group
b $ c中$ c>,因此
c
是更好的选择。像这个例子一样,我的数据中有重复的数据,这些重复数据不是最常见的 Group
组中的数据,我如何使其仅针对该行选择次最常见的数据?
Any ideas in how to tell it select the row where Group
appears in most (if not all) of the eggphases within a Tissue
Food
combination?
In my example the group that appears the most within the Tissue
and Food
combination of wb
and fl
is c
and b
but Kia
doesn't appear in Group
b
and so c
is a better option. Like this example, my data has duplicates which are from groups which are not the most common Group
, how do I make it choose next most common just for that row?
我希望我已经足够理解了。
I hope I have made enough sense.
推荐答案
一个选项将创建一个按组织,食物,组分组的频次列,然后对 n进行降序排列
并使用不同
One option would be to create a frequency column grouped by 'Tissue', 'Food', 'Group', and then do a descending arrange
on 'n' and use distinct
library(dplyr)
myDF %>%
group_by(Tissue, Food, Group) %>%
mutate(n = n()) %>% arrange(Tissue, Food, Eggphase, Name, desc(n)) %>%
ungroup %>%
distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE) %>%
select(-n)
这篇关于通过多种条件删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!