根据唯一值和列值从数据框中随机绘制行 [英] Randomly draw rows from dataframe based on unique values and column values

查看：63 发布时间：2020/10/15 20:42:57 r random data.table subset

本文介绍了根据唯一值和列值从数据框中随机绘制行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含许多描述符变量（trt，个人，会话）的数据框。我希望能够随机选择可能的 trt x个人组合的一部分，但要控制会话变量，以使随机抽取的会话数不相同。这是我的数据帧的样子：

  trt<-c（rep（c（rep（ A，3） ，rep（ B，3），rep（ C，3）），9））
个人<-rep（c（ Bob， Nancy， Tim），27） 
会话<-rep（1:27，每个= 3）
数据<-rnorm（81，平均值= 4，sd = 1）
 df<-data.frame（ trt，个人，会话，数据））
 df 
 trt个人会话数据
 1 A Bob 1 3.72013685581385 
 2 A Nancy 1 3.97225419000673 
 3 A Tim 1 4.44714175686225 
 4 B Bob 2 5.00024599458127 
 5 B Nancy 2 3.43615965145765 
 6 B Tim 2 6.7920094635501 
 7 C Bob 3 4.36315054477571 
 8 C Nancy 3 5.07117348146375 
 9 C Tim 3 4.38503325758969 
 10 A Bob 4 4.30677162933005 
 11 A Nancy 4 1.89311687510669 
 12 A Tim 4 3.09084920968413 
 13 B Bob 5 3.10436190897 144 
 14 B南希5 3.59454992439722 
 15 B蒂姆5 3.40778069131207 
 16 C鲍勃6 4.00171937800892 
 17辰南希6 0.14578811080644 
 18蒂姆6 4.20754733296227 
 19 A Bob 7 3.69131009783284 
 20 A Nancy 7 4.7025756891679 
 21 A Tim 7 4.46196017363017 
 22 B Bob 8 3.97573281432736 
 23 B Nancy 8 4.5373185942686 
 24 B Tim 8 2.40937847038141 
 25 C Bob 9 4.57519884980087 
 26 C Nancy 9 5.19143914630448 
 27 C Tim 9 4.83144732833874 
 28 A Bob 10 3.01769965527235 
 29 A Nancy 10 5.17300616827746 
 30 A Tim 10 4.65432284571663 
 31 B Bob 11 4.50892032922527 
 32 B Nancy 11 3.38082717995663 
 33 B Tim 11 4.92022245677209 
 34 C Bob 12 4.541 49796547394 
 35 C Nancy 12 3.21992774137179 
 36 C Tim 12 3.74507360931023 
 37 A Bob 13 3.39524949548056 
 38 A Nancy 13 4.17518916890901 
 39 A Tim 13 3.02932375225388 
 40 B Bob 14 3.59660910672907 
 41 B Nancy 14 2.08784850191654 
 42 B Tim 14 3.98446125755258 
 43 C Bob 15 4.01837496797085 
 44 C Nancy 15 3.40610126858125 
 45 C Tim 15 4.57107635588582 
 46 A Bob 16 3.15839276840723 
 47 A Nancy 16 2.19932140340504 
 48 A Tim 16 4.77588798035668 
 49 B Bob 17 4.3524768657397 
 50 B Nancy 17 4.49071625925856 
 51 B Tim 17 4.02576463486266 
 52 C Bob 18 3.74783360762117 
 53 C Nancy 18 2.84123227236184 
 54 C Tim 18 3.2024114782253 
 55 A鲍勃19 4.93837445490921 
 56 A Nancy 19 4.7103051496802 
 57 A Tim 19 6.22083635045134 
 58 B Bob 20 4.5177747677824 
 59 B Nancy 20 1.78839270771153 
 60 B Tim 20 5.07140678136995 
 61 C Bob 21 3.47818616035335 
 62 C Nancy 21 4.28526474048439 
 63 C Tim 21 4.22597602946575 
 64 A Bob 22 1.91700925257901 
 65 A Nancy 22 2.96317997587458 
 66 A Tim 22 2.53506974227672 
 67 B Bob 23 5.52714403395316 
 68 B Nancy 23 3.3618513551059 
 69 B Tim 23 4.85869007113978 
 70 C Bob 24 3.4367068543959 
 71 C Nancy 24 4.47769879000349 
 72 C Tim 24 5.77340483757836 
 73 A Bob 25 4.78524317734622 
 74 A Nancy 25 3.55373702554664 
 75 A Tim 25 2.88541465503637 
 76 BB ob 26 4.62885302019139 
 77 B南希26 3.59430293369092 
 78 B Tim 26 2.29610255924296 
 79 C Bob 27 4.38433001299722 
 80 C Nancy 27 3.77825207859976 
 81 C Tim 27 2.12163194694365

如何从每个 trt x个人中抽取2个具有唯一会话号的组合？这是一个示例，我希望数据帧看起来像这样：

  trt个人会话数据
 1 A Bob 1 3.72013685581385 
 5 B Nancy 2 3.43615965145765 
 7 C Bob 3 4.36315054477571 
 12 A Tim 4 3.09084920968413 
 15 B Tim 5 3.40778069131207 
 17 C Nancy 6 0.14578811080644 
 19 A Bob 7 3.69131009783284 
 29 A Nancy 10 5.17300616827746 
 31 B Bob 11 4.50892032922527 
 34 C Bob 12 4.54149796547394 
 39 A Tim 13 3.02932375225388 
 40 B Bob 14 3.59660910672907 
 47 A Nancy 16 2.19932140340504 
 51 B Tim 17 4.02576463486266 
 54 C Tim 18 3.2024114782253 
 59 B Nancy 20 1.78839270771153 
 71 C Nancy 24 4.47769879000349 
 81 C Tim 27 2.12163194694365

我尝试了几件事没有运气。 / p>

我试图随机选择两个 trt x单个组合，但最终得到重复的会话值：

  setDT（（df））
 df [，.SD [sample（.N，2）]，keyby = 。（（trt，个人）] 
 trt个人会话数据
 1：A Bob 25 2.7560788894668 
 2：A Bob 19 4.12040841647523 
 3：A Nancy 4 5.35362338127901 
 4 ：A Nancy 19 5.51636882737692 
 5：A Tim 19 5.10553640201998 
 6：A Tim 1 2.77380671625473 
 7：B Bob 23 3.50585105164409 
 8：B Bob 8 3.58167259470814 
 9 ：B南希23 2.85301307507985 
 10：B南希8 2.85179395539781 
 11：B蒂姆26 2.40666507132474 
 12：B Tim 20 3.31276311351286 
 13：C Bob 24 3.19076007024549 
 14：C Bob 3 3.59146613276121 
 15：C Nancy 9 4.46606667880457 
 16：C Nancy 15 2.25405252536256 
 17：C Tim 12 4.43111661206133 
 18：C Tim 27 4.23868848646589

我尝试随机选择每个会话号，然后提取2个 trt x个人组合，但是由于随机选择没有抓住相等数量的 trt x，通常会返回错误单个组合：

  ind<-sapply（unique（df $ session），function（x ）sample（which（df $ session == x），1））
 df.unique<-df [ind，] 
 df.sub<-df.unique [，.SD [sample （.N，2）]，按=。（trt，单个）] 
`[.data.frame`（df.unique，，.SD [sample（.N，2）]]中的错误，由= 。（（trt，个））：
未使用的参数（by =。（trt，个））

预先感谢您的帮助！

解决方案

也许是一种聪明的采样方式，但同时有一个简单的主意：

  setDT（df）
 setkey（df，session）
 
 usedsessions = 0＃一些不是会话号的值
 df [，{
 res = .SD [！。（usedsessions）] [sample（.N，2）] 
 usedsessions = c（已使用会话，res $ session）
 res 
} 
，由=。（trt，单个）] 
＃trt个人会话数据
＃1：一个Bob 7 4.256668 
＃2：一个Bob 25 2.431821 
＃3：一个Nancy 16 4.785859 
＃4：一个Nancy 19 4.865248 
＃5：A Tim 4 3.303689 
＃6：A Tim 13 3.550261 
＃7：B Bob 26 3.987136 
＃8：B Bob 17 3.283055 
＃9 ：B南希14 3.177226 
＃10：B南希2 3.639542 
＃11： B Tim 8 2.168447 
＃12：B Tim 5 3.521123 
＃13：C Bob 21 3.284245 
＃14：C Bob 12 5.773098 
＃15：C Nancy 24 4.624428 
＃16：C Nancy 9 3.235467 
＃17：C Tim 18 4.001395 
＃18：C Tim 27 5.002110

您可能需要添加特殊情况处理（例如如果没有这样的抽样）。

 
I have a dataframe with many descriptor variables (trt, individual, session). I want to be able to randomly select a fraction of the possible trt x individual combinations but control for the session variable such that no random pull has the same session number. Here is what my dataframe looks like:
trt <- c(rep(c(rep("A", 3), rep("B", 3), rep("C", 3)), 9))
individual <- rep(c("Bob", "Nancy", "Tim"), 27)
session <- rep(1:27, each = 3)
data <- rnorm(81, mean = 4, sd = 1)
df <- data.frame(trt, individual, session, data))
df
   trt individual session             data
1    A        Bob       1 3.72013685581385
2    A      Nancy       1 3.97225419000673
3    A        Tim       1 4.44714175686225
4    B        Bob       2 5.00024599458127
5    B      Nancy       2 3.43615965145765
6    B        Tim       2  6.7920094635501
7    C        Bob       3 4.36315054477571
8    C      Nancy       3 5.07117348146375
9    C        Tim       3 4.38503325758969
10   A        Bob       4 4.30677162933005
11   A      Nancy       4 1.89311687510669
12   A        Tim       4 3.09084920968413
13   B        Bob       5 3.10436190897144
14   B      Nancy       5 3.59454992439722
15   B        Tim       5 3.40778069131207
16   C        Bob       6 4.00171937800892
17   C      Nancy       6 0.14578811080644
18   C        Tim       6 4.20754733296227
19   A        Bob       7 3.69131009783284
20   A      Nancy       7  4.7025756891679
21   A        Tim       7 4.46196017363017
22   B        Bob       8 3.97573281432736
23   B      Nancy       8  4.5373185942686
24   B        Tim       8 2.40937847038141
25   C        Bob       9 4.57519884980087
26   C      Nancy       9 5.19143914630448
27   C        Tim       9 4.83144732833874
28   A        Bob      10 3.01769965527235
29   A      Nancy      10 5.17300616827746
30   A        Tim      10 4.65432284571663
31   B        Bob      11 4.50892032922527
32   B      Nancy      11 3.38082717995663
33   B        Tim      11 4.92022245677209
34   C        Bob      12 4.54149796547394
35   C      Nancy      12 3.21992774137179
36   C        Tim      12 3.74507360931023
37   A        Bob      13 3.39524949548056
38   A      Nancy      13 4.17518916890901
39   A        Tim      13 3.02932375225388
40   B        Bob      14 3.59660910672907
41   B      Nancy      14 2.08784850191654
42   B        Tim      14 3.98446125755258
43   C        Bob      15 4.01837496797085
44   C      Nancy      15 3.40610126858125
45   C        Tim      15 4.57107635588582
46   A        Bob      16 3.15839276840723
47   A      Nancy      16 2.19932140340504
48   A        Tim      16 4.77588798035668
49   B        Bob      17  4.3524768657397
50   B      Nancy      17 4.49071625925856
51   B        Tim      17 4.02576463486266
52   C        Bob      18 3.74783360762117
53   C      Nancy      18 2.84123227236184
54   C        Tim      18  3.2024114782253
55   A        Bob      19 4.93837445490921
56   A      Nancy      19  4.7103051496802
57   A        Tim      19 6.22083635045134
58   B        Bob      20  4.5177747677824
59   B      Nancy      20 1.78839270771153
60   B        Tim      20 5.07140678136995
61   C        Bob      21 3.47818616035335
62   C      Nancy      21 4.28526474048439
63   C        Tim      21 4.22597602946575
64   A        Bob      22 1.91700925257901
65   A      Nancy      22 2.96317997587458
66   A        Tim      22 2.53506974227672
67   B        Bob      23 5.52714403395316
68   B      Nancy      23  3.3618513551059
69   B        Tim      23 4.85869007113978
70   C        Bob      24  3.4367068543959
71   C      Nancy      24 4.47769879000349
72   C        Tim      24 5.77340483757836
73   A        Bob      25 4.78524317734622
74   A      Nancy      25 3.55373702554664
75   A        Tim      25 2.88541465503637
76   B        Bob      26 4.62885302019139
77   B      Nancy      26 3.59430293369092
78   B        Tim      26 2.29610255924296
79   C        Bob      27 4.38433001299722
80   C      Nancy      27 3.77825207859976
81   C        Tim      27 2.12163194694365
How do I pull out 2 of each trt x individual combinations with a unique session number? This is an example what I want the dataframe to look like:
       trt individual session             data
    1    A        Bob       1 3.72013685581385
    5    B      Nancy       2 3.43615965145765
    7    C        Bob       3 4.36315054477571
    12   A        Tim       4 3.09084920968413
    15   B        Tim       5 3.40778069131207
    17   C      Nancy       6 0.14578811080644
    19   A        Bob       7 3.69131009783284
    29   A      Nancy      10 5.17300616827746
    31   B        Bob      11 4.50892032922527
    34   C        Bob      12 4.54149796547394
    39   A        Tim      13 3.02932375225388
    40   B        Bob      14 3.59660910672907
    47   A      Nancy      16 2.19932140340504
    51   B        Tim      17 4.02576463486266
    54   C        Tim      18  3.2024114782253
    59   B      Nancy      20 1.78839270771153
    71   C      Nancy      24 4.47769879000349
    81   C        Tim      27 2.12163194694365
I have tried a couple things with no luck.

I have tried to just randomly select two trt x individual combinations, but I end up with duplicate session values:
setDT((df))
df[ , .SD[sample(.N, 2)] , keyby = .(trt, individual)]
    trt individual session             data
 1:   A        Bob      25  2.7560788894668
 2:   A        Bob      19 4.12040841647523
 3:   A      Nancy       4 5.35362338127901
 4:   A      Nancy      19 5.51636882737692
 5:   A        Tim      19 5.10553640201998
 6:   A        Tim       1 2.77380671625473
 7:   B        Bob      23 3.50585105164409
 8:   B        Bob       8 3.58167259470814
 9:   B      Nancy      23 2.85301307507985
10:   B      Nancy       8 2.85179395539781
11:   B        Tim      26 2.40666507132474
12:   B        Tim      20 3.31276311351286
13:   C        Bob      24 3.19076007024549
14:   C        Bob       3 3.59146613276121
15:   C      Nancy       9 4.46606667880457
16:   C      Nancy      15 2.25405252536256
17:   C        Tim      12 4.43111661206133
18:   C        Tim      27 4.23868848646589
I have tried randomly selecting one of each session number and then pulling 2 trt x individual combinations, but it typically comes back with an error since the random selection doesnt grab an equal number of trt x individual combinations:
ind <- sapply( unique(df$session ) , function(x) sample( which(df$session == x) , 1) )
df.unique <- df[ind, ]
df.sub <- df.unique[, .SD[sample(.N, 2)] , by = .(trt, individual)]
Error in `[.data.frame`(df.unique, , .SD[sample(.N, 2)], by = .(trt, individual)) : 
  unused argument (by = .(trt, individual))
Thanks in advance for your help!
 解决方案 
Perhaps there is a clever way to sample, but here's a straightforward idea to get you started in the meanwhile:
setDT(df)
setkey(df, session)

usedsessions = 0 # some value that's not a session number
df[, {
       res = .SD[!.(usedsessions)][sample(.N, 2)]
       usedsessions = c(usedsessions, res$session)
       res
     }
   , by = .(trt, individual)]
#    trt individual session     data
# 1:   A        Bob       7 4.256668
# 2:   A        Bob      25 2.431821
# 3:   A      Nancy      16 4.785859
# 4:   A      Nancy      19 4.865248
# 5:   A        Tim       4 3.303689
# 6:   A        Tim      13 3.550261
# 7:   B        Bob      26 3.987136
# 8:   B        Bob      17 3.283055
# 9:   B      Nancy      14 3.177226
#10:   B      Nancy       2 3.639542
#11:   B        Tim       8 2.168447
#12:   B        Tim       5 3.521123
#13:   C        Bob      21 3.284245
#14:   C        Bob      12 5.773098
#15:   C      Nancy      24 4.624428
#16:   C      Nancy       9 3.235467
#17:   C        Tim      18 4.001395
#18:   C        Tim      27 5.002110
You'll probably need to add corner case processing (e.g. if there is no such sampling).

                        这篇关于根据唯一值和列值从数据框中随机绘制行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

根据唯一值和列值从数据框中随机绘制行 [英] Randomly draw rows from dataframe based on unique values and column values

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

根据唯一值和列值从数据框中随机绘制行 [英] Randomly draw rows from dataframe based on unique values and column values

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭