如何在数据表中对组进行抽样注释 [英] How do you sample groups in a data.table with a caveat
问题描述
这个问题与非常相似。在数据表中。
区别在于一个微小的细节,我没有足够的声誉来讨论这个问题本身。
The difference is in a minor subtlety that I did not have enough reputation to discuss for that question itself.
让我们改变Christopher Manning的初始数据:
Let's change Christopher Manning's initial data a little bit:
> DT = data.table(a=c(1,1,1,1:15,1,1), b=sample(1:1000,20))
> DT
a b
1: 1 102
2: 1 5
3: 1 658
4: 1 499
5: 2 632
6: 3 186
7: 4 761
8: 5 150
9: 6 423
10: 7 832
11: 8 883
12: 9 247
13: 10 894
14: 11 141
15: 12 891
16: 13 488
17: 14 101
18: 15 677
19: 1 400
20: 1 467
如果我们尝试了问题的解决方案:
If we tried the question's solution:
> DT[,.SD[sample(.N,3)],by = a]
Error in sample.int(x, size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
这是因为列a中只有一次出现的值。我们不能对不到三次的值进行3次采样,而不使用替换(我们不想这样做)。
This is because there are values in column a that only occur once. We cannot sample 3 times for values that occur less than three times without using replacement (which we do not want to do).
我很难处理这种情况。当出现次数> = 3时,我们想要采样3次,但是如果出现次数<例如,对于我们的DT,我们需要:
I am struggling to deal with this scenario. We want to sample 3 times when the number of occurrences is >= 3, but pull the number of occurrences if it is < 3. For example with our DT above we would want:
a b
1: 1 102
2: 1 5
3: 1 658
4: 2 632
5: 3 186
6: 4 761
7: 5 150
8: 6 423
9: 7 832
10: 8 883
11: 9 247
12: 10 894
13: 11 141
14: 12 891
15: 13 488
16: 14 101
17: 15 677
也许一个解决方案可能涉及排序
data.table这样,然后使用 rle()
lengths
以找出在上面的示例函数中使用的 n
:
Maybe a solution could involve sorting
the data.table like this, then using rle()
lengths
to find out which n
to use in the sample function above:
> DT <- DT[order(DT$a),]
> DT
a b
1: 1 102
2: 1 5
3: 1 658
4: 1 499
5: 1 400
6: 1 467
7: 2 632
8: 3 186
9: 4 761
10: 5 150
11: 6 423
12: 7 832
13: 8 883
14: 9 247
15: 10 894
16: 11 141
17: 12 891
18: 13 488
19: 14 101
20: 15 677
> ifelse(rle(DT$a)$lengths >= 3, 3,rle(DT$a)$lengths)
> [1] 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3与n,这将返回我们应该从a = 1,a = 2,a = 3 ...
取样多少我还没有找到一种方法将这个结合到一个最终的解决方案。
If we replace "3" with n, this will return how much we should sample from a=1, a=2, a=3... I have yet to find a way to incorporate this into a final solution. Any help would be appreciated!
推荐答案
我可能会误解你的问题,但是你在寻找这样的东西吗?
I might be misunderstanding your question, but are you looking for something like this?
set.seed(123)
##
DT <- data.table(
a=c(1,1,1,1:15,1,1),
b=sample(1:1000,20))
##
R> DT[,.SD[sample(.N,min(.N,3))],by = a]
a b
1: 1 288
2: 1 881
3: 1 409
4: 2 937
5: 3 46
6: 4 525
7: 5 887
8: 6 548
9: 7 453
10: 8 948
11: 9 449
12: 10 670
13: 11 566
14: 12 102
15: 13 993
16: 14 243
17: 15 42
b
for a_i
如果 a_i
包含三个或更多值,否则我们只绘制 n
值,其中 n
( n
)是组
a_i
的大小。
where we are drawing 3 samples from b
for group a_i
if a_i
contains three or more values, else we draw only n
values, where n
(n < 3
) is the size of group a_i
.
只是为了演示,下面是 b
的6个可能的值, a = 1
Just for demonstration, here are the 6 possible values of b
for a=1
that we are sampling from (assuming you use the same random seed as above):
R> DT[order(a)][1:6,]
a b
1: 1 288
2: 1 788
3: 1 409
4: 1 881
5: 1 323
6: 1 996
这篇关于如何在数据表中对组进行抽样注释的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!