基于频率水平的子集 [英] subset based on frequency level
问题描述
我想生成一个df,该df选择与"ID"关联的行,而"ID"又与称为cutoff的变量关联.对于此示例,我将截止值设置为9,这意味着我想在df1中选择ID值与9条以上的行相关联的行.我的代码的最后一行生成了我不了解的df.正确的df将有24行,在ID列中全部包含3或4.有人可以解释我的最后一行代码实际上在做什么,并建议其他方法吗?
I want to generate a df that selects rows associated with an "ID" that in turn is associated with a variable called cutoff. For this example, I set the cutoff to 9, meaning that I want to select rows in df1 whose ID value is associated with more than 9 rows. The last line of my code generates a df that I don't understand. The correct df would have 24 rows, all with either a 3 or a 4 in the ID column. Can someone explain what my last line of code is actually doing and suggest a different approach?
set.seed(123)
ID<-rep(c(1,2,3,4,5),times=c(5,7,9,11,13))
sub1<-rnorm(45)
sub2<-rnorm(45)
df1<-data.frame(ID,sub1,sub2)
IDfreq<-count(df1,"ID")
cutoff<-9
df2<-subset(df1,subset=(IDfreq$freq>cutoff))
推荐答案
df1[ df1$ID %in% names(table(df1$ID))[table(df1$ID) >9] , ]
这将测试df1 $ ID值是否在具有9个以上值的类别中.如果是,则返回矢量的逻辑元素将为TRUE,并且由于"j"项为空,因此作为"i"参数将导致[
函数返回整行.
This will test to see if the df1$ID value is in a category with more than 9 values. If it is, then the logical element for the returned vector will be TRUE and in turn that as the "i" argument will cause the [
-function to return the entire row since the "j" item is empty.
请参阅:
?`[`
?'%in%'
这篇关于基于频率水平的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!