基于因子变量中的观察数进行子集化 [英] subsetting based on number of observations in a factor variable

查看:24
本文介绍了基于因子变量中的观察数进行子集化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何根据因子变量水平的观察次数进行子集化?我有一个包含 1,000,000 行和近 3000 个级别的数据集,我想用较少的 200 个观察值对级别进行子集化.

how do you subset based on the number of observations of the levels of a factor variable? I have a dataset with 1,000,000 rows and nearly 3000 levels, and I want to subset out the levels with less say 200 observations.

data <- read.csv("~/Dropbox/Shared/data.csv", sep=";")

summary(as.factor(data$factor)
10001 10002 10003 10004 10005 10006 10007 10009 10010 10011 10012 10013 10014 10016        10017 10018 10019 10020 
  414   741  2202   205   159   591   194   678   581   774   778   738  1133   997   381   157   522     6 
10021 10022 10023 10024 10025 10026 10027 10028 10029 10030 10031 10032 10033 10034 10035 10036 10037 10038 
  398   416  1236   797   943   386   446   542   508   309   452   482   425   272   261   291   145   598 
10039 10040 10041 10043 10044 10065 10069 10075 10080 10104 10105 10106 10110 10112 10115 10117 10119 10121 
  119   263     9     9   179   390    70   465    19     3     7     5     4     1     1     1     2     6 
10123 10128 10150 10152 10154 10155 10168 10170 10173 10174 10176 10199 10210 10220 10240 10265 10270 10271 
    2   611     8     1     1     2    10     1     6     5     5     2     5     2     1     3     5     2 

从上面的总结中可以看出,有些因子只有几个obs,我想去掉小于100的因子.

as you see from the summary, above, there are factors with only a few obs, and I want to remove the factors that have less than 100.

我尝试了以下方法,但没有用:

I tried the following, but it didn't work:

for (n in unique((data$factor))) {
    m<-subset(data, factor==n)
    o<-length(m[,1])
    data<-ifelse( o<100, subset(data, factor!=n), data)
}

推荐答案

table, 子集,并根据该子集的名称进行匹配.之后可能会想要droplevels.

table, subset that, and match based on the names of that subset. Probably will want to droplevels thereafter.

EIDT

一些示例数据:

set.seed(1234)
data <- data.frame(factor = factor(sample(10000:12999, 1000000, 
  TRUE, prob=rexp(3000))))

有一些类别,案例很少

> min(table(data$factor))
[1] 1

从具有相同factor值的记录中删除少于100条的记录.

Remove records from case with less than 100 of those with the same value of factor.

tbl <- table(data$factor)
data <- droplevels(data[data$factor %in% names(tbl)[tbl >= 100],,drop=FALSE])

检查:

> min(table(data$factor))
[1] 100

请注意,datafactor 不是很好的名称,因为它们也是内置函数.

Note that data and factor are not very good names since they are also builtin functions.

这篇关于基于因子变量中的观察数进行子集化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆