R data.table-以不同的采样比例按组采样 [英] R data.table - sample by group with different sampling proportion

查看:55
本文介绍了R data.table-以不同的采样比例按组采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想有效地从 data.table 中按组进行随机抽样,但是应该可以为每个组采样不同的比例.

I would like to efficiently make a random sample by group from a data.table, but it should be possible to sample a different proportion for each group.

如果我想从每个组中采样分数 sampling_fraction ,我可能会受到相关的答案有关像:

If I wanted to sample fraction sampling_fraction from each group, i could get inspired by this question and related answer to do something like:

DT = data.table(a = sample(1:2), b = sample(1:1000,20))

group_sampler <- function(data, group_col, sample_fraction){
  # this function samples sample_fraction <0,1> from each group in the data.table
  # inputs:
  #   data - data.table
  #   group_col - column(s) used to group by
  #   sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
  data[,.SD[sample(.N, ceiling(.N*sample_fraction))],by = eval(group_col)]
}

# what % of data should be sampled
sampling_fraction = 0.5

# perform the sampling
sampled_dt <- group_sampler(DT, 'a', sampling_fraction)

但是,如果我想从第1组中抽取10%,从第2组中抽取50%,该怎么办?

But what if i wanted to sample 10% from group 1 and 50% from group 2?

推荐答案

您可以使用 .GRP ,但要确保匹配正确的组..您可能需要定义 group_col 作为因子变量.

You can use .GRP but to ensure a correct group is matched.. you might want to define group_col as a factor variable.

group_sampler <- function(data, group_col, sample_fractions) {
  # this function samples sample_fraction <0,1> from each group in the data.table
  # inputs:
  #   data - data.table
  #   group_col - column(s) used to group by
  #   sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
  stopifnot(length(sample_fractions) == uniqueN(data[[group_col]]))
  data[, .SD[sample(.N, ceiling(.N*sample_fractions[.GRP]))], keyby = group_col]
}

根据chinsoon12的评论进行

使用函数的最后一行会更安全(而不是依靠正确的顺序):

It would be safer (instead of relying on correct order) to have the last line of the function:

data[, .SD[sample(.N, ceiling(.N*sample_fractions[[unlist(.BY)]]))], keyby = group_col]

然后将 sample_fractions 作为命名向量传递:

And then you pass sample_fractions as a named vector:

group_sampler(DT, 'a', sample_fractions= c(x = 0.1, y = 0.9))

这篇关于R data.table-以不同的采样比例按组采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆