如何在数据表中对组进行抽样注释 [英] How do you sample groups in a data.table with a caveat

查看：101 发布时间：2017/3/12 11:09:46 r data.table

本文介绍了如何在数据表中对组进行抽样注释的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

区别在于一个微小的细节，我没有足够的声誉来讨论这个问题本身。

The difference is in a minor subtlety that I did not have enough reputation to discuss for that question itself.

让我们改变Christopher Manning的初始数据：

Let's change Christopher Manning's initial data a little bit:

> DT = data.table(a=c(1,1,1,1:15,1,1), b=sample(1:1000,20))
> DT
     a   b
 1:  1 102
 2:  1   5
 3:  1 658
 4:  1 499
 5:  2 632
 6:  3 186
 7:  4 761
 8:  5 150
 9:  6 423
10:  7 832
11:  8 883
12:  9 247
13: 10 894
14: 11 141
15: 12 891
16: 13 488
17: 14 101
18: 15 677
19:  1 400
20:  1 467

如果我们尝试了问题的解决方案：

If we tried the question's solution:

> DT[,.SD[sample(.N,3)],by = a]
 Error in sample.int(x, size, replace, prob) : 
  cannot take a sample larger than the population when 'replace = FALSE'

这是因为列a中只有一次出现的值。我们不能对不到三次的值进行3次采样，而不使用替换（我们不想这样做）。

This is because there are values in column a that only occur once. We cannot sample 3 times for values that occur less than three times without using replacement (which we do not want to do).

我很难处理这种情况。当出现次数> = 3时，我们想要采样3次，但是如果出现次数<例如，对于我们的DT，我们需要：

I am struggling to deal with this scenario. We want to sample 3 times when the number of occurrences is >= 3, but pull the number of occurrences if it is < 3. For example with our DT above we would want:

也许一个解决方案可能涉及排序 data.table这样，然后使用 rle（） lengths 以找出在上面的示例函数中使用的 n ：

Maybe a solution could involve sorting the data.table like this, then using rle() lengths to find out which n to use in the sample function above:

> DT <- DT[order(DT$a),]
> DT
     a   b
 1:  1 102
 2:  1   5
 3:  1 658
 4:  1 499
 5:  1 400
 6:  1 467
 7:  2 632
 8:  3 186
 9:  4 761
10:  5 150
11:  6 423
12:  7 832
13:  8 883
14:  9 247
15: 10 894
16: 11 141
17: 12 891
18: 13 488
19: 14 101
20: 15 677

> ifelse(rle(DT$a)$lengths >= 3, 3,rle(DT$a)$lengths)
> [1] 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1

3与n，这将返回我们应该从a = 1，a = 2，a = 3 ...
取样多少我还没有找到一种方法将这个结合到一个最终的解决方案。

If we replace "3" with n, this will return how much we should sample from a=1, a=2, a=3... I have yet to find a way to incorporate this into a final solution. Any help would be appreciated!

推荐答案

我可能会误解你的问题，但是你在寻找这样的东西吗？

I might be misunderstanding your question, but are you looking for something like this?

set.seed(123)
##
DT <- data.table(
  a=c(1,1,1,1:15,1,1), 
  b=sample(1:1000,20))
##
R> DT[,.SD[sample(.N,min(.N,3))],by = a]
     a   b
 1:  1 288
 2:  1 881
 3:  1 409
 4:  2 937
 5:  3  46
 6:  4 525
 7:  5 887
 8:  6 548
 9:  7 453
10:  8 948
11:  9 449
12: 10 670
13: 11 566
14: 12 102
15: 13 993
16: 14 243
17: 15  42

b for a_i 如果 a_i 包含三个或更多值，否则我们只绘制 n 值，其中 n （ n ）是组 a_i 的大小。


where we are drawing 3 samples from b for group a_i if a_i contains three or more values, else we draw only n values, where n (n < 3) is the size of group a_i. 
只是为了演示，下面是 b 的6个可能的值， a = 1 

Just for demonstration, here are the 6 possible values of b for a=1 that we are sampling from (assuming you use the same random seed as above):
R> DT[order(a)][1:6,]
   a   b
1: 1 288
2: 1 788
3: 1 409
4: 1 881
5: 1 323
6: 1 996


                        这篇关于如何在数据表中对组进行抽样注释的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何在数据表中对组进行抽样注释 [英] How do you sample groups in a data.table with a caveat

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在数据表中对组进行抽样注释 [英] How do you sample groups in a data.table with a caveat

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭