在data.table中按组标记随机选择的N行 [英] Flag randomly selected N rows by group in data.table

查看:42
本文介绍了在data.table中按组标记随机选择的N行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在C3列的data.table中,我要标记每个组(C1)随机选择的N行.在SO 此处此处.但是基于答案,仍然无法弄清楚如何为我的任务找到解决方案.

At the data.table in column C3 I want to flag N randomly selected rows by each group (C1). There are several similar questions have already been asked on SO here, here and here. But based on the answers still cannot figure out how to find a solution for my task.

set.seed(1)    
dt = data.table(C1 = c("A","A","A","B","C","C","C","D","D","D"), 
                 C2 = c(2,1,3,1,2,3,4,5,4,5)) 

dt
    C1 C2
 1:  A  2
 2:  A  1
 3:  A  3
 4:  B  1
 5:  C  2
 6:  C  3
 7:  C  4
 8:  D  5
 9:  D  4
10:  D  5

以下是每个C1组随机选择的两行的行索引(不适用于B组)

Here are row indexes for two randomly selected rows by each group C1 (doesn't work well for group B):

dt[, sample(.I, min(.N, 2)), by = C1]$V1
[1]  1  3  3  7  5 10  9

NB:对于B,仅应选择一行,因为组B仅包含一行.

这是一种针对每个组中随机选择的行的解决方案,这通常不适用于B组:

Here is a solution for one randomly selected row in each group, which often doesn't work for group B:

dt[, C3 := .I == sample(.I, 1), by = C1]
dt
    C1 C2    C3
 1:  A  2 FALSE
 2:  A  1  TRUE
 3:  A  3 FALSE
 4:  B  1 FALSE
 5:  C  2  TRUE
 6:  C  3 FALSE
 7:  C  4 FALSE
 8:  D  5  TRUE
 9:  D  4 FALSE
10:  D  5 FALSE

实际上,我想将其扩展到N行.我已经尝试了(两行):

Actually I want to expand it on N rows. I've tried (for two rows):

dt[, C3 := .I==sample(.I, min(.N, 2)), by = C1]

那当然是行不通的.

非常感谢您的帮助!

推荐答案

dt[, C3 := 1:.N %in% sample(.N, min(.N, 2)), by = C1]

或使用 head ,但我认为应该慢一些

Or use head, but I think that should be slower

dt[, C3 := 1:.N %in% head(sample(.N), 2) , by = C1]

如果标记的行数不是恒定的,则可以

If the number of flagged rows is not constant you can do

flagsz <- c(2, 1, 2, 3)
dt[, C3 := 1:.N %in% sample(.N, min(.N, flagsz[.GRP])), by = C1]

这篇关于在data.table中按组标记随机选择的N行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆