在满足特定条件时允许最大条目数 [英] Allow a maximum number of entries when certain conditions apply
问题描述
我有一个数据集,有很多条目。每个条目都属于某个ID(belongsID),条目是唯一的(使用uniqID),但是多个条目可以来自同一个源(sourceID)。来自相同源的多个条目也可能具有相同的所属ID。为了研究的目的,我需要做的数据集,我必须摆脱一个单一sourceID的条目出现超过5次为1属于。
I have a dataset with a lot of entries. Each of these entries belongs to a certain ID (belongID), the entries are unique (with uniqID), but multiple entries can come from the same source (sourceID). It is also possible that multiple entries from the same source have a the same belongID. For the purposes of the research I need to do on the dataset I have to get rid of the entries of a single sourceID that occur more than 5 times for 1 belongID. The maximum of 5 entries that need to be kept are the ones with the highest 'Time' value.
为了说明这一点,我有以下示例数据集:
To illustrate this I have the following example dataset:
belongID sourceID uniqID Time
1 1001 101 5
1 1002 102 5
1 1001 103 4
1 1001 104 3
1 1001 105 3
1 1005 106 2
1 1001 107 2
1 1001 108 2
2 1005 109 5
2 1006 110 5
2 1005 111 5
2 1006 112 5
2 1005 113 5
2 1006 114 4
2 1005 115 4
2 1006 116 3
2 1005 117 3
2 1006 118 3
2 1005 119 2
2 1006 120 2
2 1005 121 1
2 1007 122 1
3 1010 123 5
3 1480 124 2
应如下所示:
belongID sourceID uniqID Time
1 1001 101 5
1 1002 102 5
1 1001 103 4
1 1001 104 3
1 1001 105 3
1 1005 106 2
1 1001 107 2
2 1005 109 5
2 1006 110 5
2 1005 111 5
2 1006 112 5
2 1005 113 5
2 1006 114 4
2 1005 115 4
2 1006 116 3
2 1005 117 3
2 1006 118 3
2 1007 122 1
3 1010 123 5
3 1480 124 2
文件中有很多列包含数据条目,但选择必须完全基于时间。如示例所示,还可能发生具有相同的所属ID的sourceID的第5和第6个条目具有相同的时间。在这种情况下,只需要选择1,因为max = 5。
There are a lot more columns with data entries in the file, but the selection has to be purely based on time. As shown in the example it can also occur that the 5th and 6th entry of a sourceID with the same belongID have the same time. In this case only 1 has to be chosen, because max=5.
为了说明的目的,这里的数据集在belongsID和time上排序很好,但在实际数据集中不是这样的。任何想法如何解决这个问题?我没有遇到类似的东西..
The dataset here is nicely ordered on belongID and time for illustrative purposes, but in the real dataset this is not the case. Any idea how to tackle this problem? I have not come across something similar yet..
推荐答案
if dat
是您的数据框架:
if dat
is your dataframe:
do.call(rbind,
by(dat, INDICES=list(dat$belongID, dat$sourceID),
FUN=function(x) head(x[order(x$Time, decreasing=TRUE), ], 5)))
这篇关于在满足特定条件时允许最大条目数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!