在满足特定条件时允许最大条目数 [英] Allow a maximum number of entries when certain conditions apply

查看:136
本文介绍了在满足特定条件时允许最大条目数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,有很多条目。每个条目都属于某个ID(belongsID),​​条目是唯一的(使用uniqID),但是多个条目可以来自同一个源(sourceID)。来自相同源的多个条目也可能具有相同的所属ID。为了研究的目的,我需要做的数据集,我必须摆脱一个单一sourceID的条目出现超过5次为1属于。

I have a dataset with a lot of entries. Each of these entries belongs to a certain ID (belongID), the entries are unique (with uniqID), but multiple entries can come from the same source (sourceID). It is also possible that multiple entries from the same source have a the same belongID. For the purposes of the research I need to do on the dataset I have to get rid of the entries of a single sourceID that occur more than 5 times for 1 belongID. The maximum of 5 entries that need to be kept are the ones with the highest 'Time' value.

为了说明这一点,我有以下示例数据集:

To illustrate this I have the following example dataset:

   belongID   sourceID uniqID   Time     
   1           1001     101       5            
   1           1002     102       5        
   1           1001     103       4        
   1           1001     104       3       
   1           1001     105       3     
   1           1005     106       2        
   1           1001     107       2       
   1           1001     108       2       
   2           1005     109       5                
   2           1006     110       5        
   2           1005     111       5        
   2           1006     112       5        
   2           1005     113       5      
   2           1006     114       4        
   2           1005     115       4        
   2           1006     116       3       
   2           1005     117       3                
   2           1006     118       3       
   2           1005     119       2        
   2           1006     120       2        
   2           1005     121       1      
   2           1007     122       1        
   3           1010     123       5        
   3           1480     124       2  

应如下所示:

   belongID   sourceID uniqID   Time     
   1           1001     101       5            
   1           1002     102       5        
   1           1001     103       4        
   1           1001     104       3       
   1           1001     105       3     
   1           1005     106       2        
   1           1001     107       2           
   2           1005     109       5                
   2           1006     110       5        
   2           1005     111       5        
   2           1006     112       5        
   2           1005     113       5      
   2           1006     114       4        
   2           1005     115       4        
   2           1006     116       3       
   2           1005     117       3                
   2           1006     118       3           
   2           1007     122       1        
   3           1010     123       5        
   3           1480     124       2     

文件中有很多列包含数据条目,但选择必须完全基于时间。如示例所示,还可能发生具有相同的所属ID的sourceID的第5和第6个条目具有相同的时间。在这种情况下,只需要选择1,因为max = 5。

There are a lot more columns with data entries in the file, but the selection has to be purely based on time. As shown in the example it can also occur that the 5th and 6th entry of a sourceID with the same belongID have the same time. In this case only 1 has to be chosen, because max=5.

为了说明的目的,这里的数据集在belongsID和time上排序很好,但在实际数据集中不是这样的。任何想法如何解决这个问题?我没有遇到类似的东西..

The dataset here is nicely ordered on belongID and time for illustrative purposes, but in the real dataset this is not the case. Any idea how to tackle this problem? I have not come across something similar yet..

推荐答案

if dat 是您的数据框架:

if dat is your dataframe:

do.call(rbind, 
        by(dat, INDICES=list(dat$belongID, dat$sourceID), 
           FUN=function(x) head(x[order(x$Time, decreasing=TRUE), ], 5)))

这篇关于在满足特定条件时允许最大条目数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆