尝试在两列中保留非重复值时如何使用R中的数据帧的条件过滤 [英] How to use conditional filtering of a data frame in R when trying to retain non-duplicated values in two columns
问题描述
我有一个这样组织的数据框:
I have a data frame that is organized as such:
df <- data.frame(ID=c(rep("1111", 16),rep("2222", 16)),
subID=rep(c(rep("100", 4), rep("200", 4), rep("300", 4), rep("400", 4)),2),
instance=rep(1:4, 8),
feature=rep(letters[1:4], 8)
)
看起来像这样:
> df
ID subID instance feature
1 1111 100 1 a
2 1111 100 2 b
3 1111 100 3 c
4 1111 100 4 d
5 1111 200 1 a
6 1111 200 2 b
7 1111 200 3 c
8 1111 200 4 d
9 1111 300 1 a
10 1111 300 2 b
11 1111 300 3 c
12 1111 300 4 d
13 1111 400 1 a
14 1111 400 2 b
15 1111 400 3 c
16 1111 400 4 d
17 2222 100 1 a
18 2222 100 2 b
19 2222 100 3 c
20 2222 100 4 d
21 2222 200 1 a
22 2222 200 2 b
23 2222 200 3 c
24 2222 200 4 d
25 2222 300 1 a
26 2222 300 2 b
27 2222 300 3 c
28 2222 300 4 d
29 2222 400 1 a
30 2222 400 2 b
31 2222 400 3 c
32 2222 400 4 d
在真实数据集中,所有子ID都是从同一ID收集的唯一样本.您可以将它们视为在同一时间的四个时间点收集的样本.子ID 100到400分别与4个实例之一(即100 = 2、200 = 4、300 = 3和400 = 1)相关联,并且对于整个ID是唯一的.但我不知道实际的联系,因此需要进行手动记录审查以分配联系.为了使审核更快,我想保留每个subID的一个和每个实例的一个,就像这样:
In the real data set, all subIDs are unique samples collected from the same ID. You can think of them as a sample collected at four time points from the same location. The subIDs 100 through 400 are each associated with one of the 4 instances (i.e., 100 = 2, 200 = 4, 300 = 3, and 400 = 1), and are unique to the overall ID. but I do not know the actual linkage and will need to do a manual record review to assign the linkages. To make my review quicker, I want to retain one of each of the subID's and one of each of the instances, like so:
ID subID instance feature truesubID
1 1111 100 1 a
2 1111 200 2 b
3 1111 300 3 c
4 1111 400 4 d
5 2222 100 1 a
6 2222 200 2 b
7 2222 300 3 c
8 2222 400 4 d
这样,当我进行手动记录检查时,我知道可能的子ID号是什么,它们属于哪个ID,并且我知道要交叉引用的实例数.然后,我将真实的subID填写到最后一栏中.(例如,对于ID = 1111,subID = 100实际上是instance = 4,等等)
This way, when I do manual record review, I know what the possible subID numbers are, which ID they belong to, and I know how many instances to cross reference. I will then fill in the true subID into the last column. (e.g., subID=100 is really instance=4 for ID=1111, etc.)
您知道如何过滤第一个df,使其看起来像第二个吗?
Do you know how I could filter the first df to look like the second?
谢谢!
推荐答案
您的数据框中有一个模式.您可以删除第五行以获得所需的结果:
There is a pattern in your dataframe. You can remove every fifth row to get your desired result:
df1 <- df %>%
group_by(ID) %>%
slice(which(row_number() %% 5 == 1))
哪一个给你呢?
:由于扩展了信息:每个ID是否具有可变数量的实例的解决方案:**
due to extend information: a solution for if each ID has variable numbers of instances:**
df1 <- df %>%
group_split(ID) %>%
purrr::map_df(~.x %>% group_by(subID) %>%
slice(cur_group_id())
)
这篇关于尝试在两列中保留非重复值时如何使用R中的数据帧的条件过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!