通过多种条件删除重复项 [英] Remove duplicates by multiple conditions

查看:133
本文介绍了通过多种条件删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据,其中一个(名称)在卵期类别中多次出现。我希望每个人只有一个样本,但我不只是想保留R发现的第一个样本。我想保持该小组在所有其他类别中出现最多的那个。希望我的示例可以帮助您弄清楚这一点。

I have data where an individual (Name) appears multiple times in a eggphase category. I would like for there only to be one sample per individual but I don't just want to keep the first one the R finds. I would like to keep the one where the group appears most in all other categories. Hopefully my example helps make this clear.

library(tidyverse)
myDF <- read.table(text="Tissue Food Eggphase Name Group
  wb fl after Kia a
  wb fl after Kia c
  wb wf before Kia b
  wb fl before Lucy c
  wb fl after Lucy b
  wb fl after Lucy c
  wb fl yolkdep Jess c
  wb fl yolkdep Betty a
  wb fl yolkdep Betty b", header = TRUE)

我只想保留曾经按组织,食物和蛋相分组的名称出现的行,但我想选择组所在的行

I would like to just keep the rows where Name appears once grouped by Tissue, Food and Eggphase BUT I want to select the row where Group appears in most if not all different eggphases (with the same Tissue and Food combinations).

   #results I want
  Tissue Food Eggphase  Name Group
1     wb   fl    after   Kia     c
2     wb   wf   before   Kia     b
3     wb   fl   before  Lucy     c
4     wb   fl    after  Lucy     c
5     wb   fl  yolkdep  Jess     c
6     wb   fl  yolkdep Betty     b

我尝试过

one_bird <- myDF %>% 
  distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE)

,但仅保留第一个条目

  Tissue Food Eggphase  Name Group
1     wb   fl    after   Kia     a
2     wb   wf   before   Kia     b
3     wb   fl   before  Lucy     c
4     wb   fl    after  Lucy     b
5     wb   fl  yolkdep  Jess     c
6     wb   fl  yolkdep Betty     b

关于如何分辨它的任何想法,请选择行出现在组织 食物组合?
在我的示例中,出现在<$ c $组织的组织食物组合中最多的组c> wb
fl c b ,但起亚没有出现在 Group b ,因此 c 是更好的选择。像这个例子一样,我的数据中有重复的数据,这些重复数据不是最常见的 Group 组中的数据,我如何使其仅针对该行选择次最常见的数据?

Any ideas in how to tell it select the row where Groupappears in most (if not all) of the eggphases within a Tissue Food combination? In my example the group that appears the most within the Tissue and Food combination of wb and fl is c and b but Kia doesn't appear in Group b and so c is a better option. Like this example, my data has duplicates which are from groups which are not the most common Group, how do I make it choose next most common just for that row?

我希望我已经足够理解了。

I hope I have made enough sense.

推荐答案

一个选项将创建一个按组织,食物,组分组的频次列,然后对 n进行降序排列并使用不同

One option would be to create a frequency column grouped by 'Tissue', 'Food', 'Group', and then do a descending arrange on 'n' and use distinct

library(dplyr)
myDF %>%
     group_by(Tissue, Food, Group) %>%
     mutate(n = n()) %>% arrange(Tissue, Food, Eggphase, Name, desc(n)) %>% 
     ungroup %>%
     distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE) %>%
     select(-n)

这篇关于通过多种条件删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆