如何从数据子集中随机抽取并引导R中的统计检验 [英] How to randomly draw from subsets of data and bootstrap a statistic test in R

查看:313
本文介绍了如何从数据子集中随机抽取并引导R中的统计检验的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含两个变量的数据集,并且希望统计地测试它们是否在自举循环中相关(即使用Spearman的等级校正和cor.test(...)).

I have a dataset containing two variables and I wish to statistically test whether they are related in a bootstrap loop (i.e. using Spearman’s rank correction with cor.test(...)).

我的数据集中大多数测量值来自独立的样本单位(简称单位工厂),尽管某些测量值来自同一工厂.为了处理伪复制问题,我希望多次引导统计测试,每次测试运行中每个工厂仅使用一次测量.因此,在执行相关测试之前,我需要编写一个自举循环,该循环将为每个植物随机绘制一个度量(然后将该过程重复99次).

Most of the measurements in my dataset are from independent sample units (let’s call the units plants), although some measurements come from the same plant. To deal with issues of pseudoreplication, I wish to bootstrap the statistic test a number of times, using only one measurement from each plant in each run of the test. I therefore need to write a bootstrap loop that will randomly draw one measurement for each plant, before performing the correlation test (and then repeat this process 99 times).

我希望最终得到一个csv文件,其中包含99个测试中每个测试的p值,rho和S统计信息.

I wish to end up with a csv file containing the p-value, rho and S statistic for each of the 99 tests.

示例数据:

dput(df)

structure(list(Plant = c(1L, 2L, 3L, 4L, 5L, 6L, 6L, 7L, 8L, 
9L, 10L, 10L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 18L, 
19L, 20L, 21L), Length = c(170L, 232L, 123L, 190L, 112L, 207L, 
93L, 291L, 178L, 206L, 141L, 257L, 304L, 222L, 279L, 192L, 101L, 
253L, 176L, 278L, 311L, 129L, 191L, 205L, 226L), Count = c(7L, 
9L, 5L, 7L, 5L, 6L, 2L, 10L, 6L, 7L, 4L, 8L, 11L, 7L, 8L, 5L, 
5L, 9L, 7L, 6L, 9L, 4L, 5L, 7L, 6L)), .Names = c("Plant", "Length", 
"Count"), class = "data.frame", row.names = c(NA, -25L))


   Plant Length Count
1      1    170     7
2      2    232     9
3      3    123     5
4      4    190     7
5      5    112     5
6      6    207     6
7      6     93     2   
8      7    291    10  etc....

到目前为止,我整理了以下代码,该代码首先为由多行表示的每个工厂随机绘制一行,然后将这些值与其余数据结合起来,然后再运行统计测试.但是,我现在正在努力合并自举功能(即boot()bootstrap())来运行统计信息测试并多次执行循环:

So far, I have put together the below code, which begins by randomly drawing a single row for each plant represented by multiple rows, and combines these values with the rest of the data before running the stats test. However, I am now struggling to incorporate a bootstrapping function (i.e. boot() or bootstrap()) to run the stats test and perform the cycle multiple times:

# 1. create dataframe without plants with >1 measurement/row (in this example plant 6,10 & 18 have multiple rows)
df_uniq = df[ ! df$Plant %in% c(6,10,18), ]

# 2. create data subsets for each plant with >1 measurement/row
dup1 = df[6:7,]
dup2 = df[11:13,] 
dup3 = df[21:22,]

# 3. randomly draw one row for each plant with multiple measurements
d1_draw = dup1[sample(nrow(dup1), 1), ]
d2_draw = dup2[sample(nrow(dup2), 1), ]
d3_draw = dup3[sample(nrow(dup3), 1), ]

# 4. merge df_uniq with randomly drawn rows for each plant with multiple measurements
df_merge = rbind(df_uniq, d1_draw, d2_draw, d3_draw)

# 5. Test whether the two variables (length & Count) are related and write results to file
cor_res <- cor.test(df_merge$Length, df_merge$Count, method= "spearman")
write.csv(matrix(c(cor_res$statistic, cor_res$p.value, cor_res$estimate)), row.names=c("statistic", "p.value", "rho"), "test_output.csv")

我确信有一种快速而优雅的方法可以解决问题.任何帮助将不胜感激!非常感谢.

I am sure that there is a quick and elegant way to solve the problem. Any assistance would be greatly appreciated! Many thanks.

推荐答案

为什么要首先提取唯一行?如果只有一行,那么对该植物进行一次采样将导致维持该行,但仍然从具有多行的植物中随机采样.

why extract the unique rows in the first place? If there is only one row, then sampling that plant once will result in maintaining that row but still sampling randomly from those with more than one row.

所以您可以这样做:

set.seed(123)
library(plyr)
ddply(df, .(Plant), function(x) { y <- x[sample(nrow(x), 1) ,] })

#   Plant Length Count howmany
#1      1    170     7       1
#2      2    232     9       1
#3      3    123     5       1
#4      4    190     7       1
#5      5    112     5       1
#6      6    207     6       2
#7      7    291    10       1
#8      8    178     6       1
#9      9    206     7       1
#10    10    257     8       3
#11    11    222     7       1
#12    12    279     8       1
#13    13    192     5       1
#14    14    101     5       1
#15    15    253     9       1
#16    16    176     7       1
#17    17    278     6       1
#18    18    311     9       2
#19    19    191     5       1
#20    20    205     7       1
#21    21    226     6       1

以及您的cor.test

# first create your own function:
myrandomcors <- function(P){
ss <- ddply(P, .(Plant), function(x) { y <- x[sample(nrow(x), 1) ,] })
cor_res <- cor.test(ss$Length, ss$Count, method= "spearman")
return(c(stat = cor_res$statistic, p = cor_res$p.value, est = cor_res$estimate))
}

# then repeat it 5 times...
answer <- do.call( rbind, replicate(5, myrandomcors(df), simplify=FALSE ) )

#    > answer
#       stat.S            p   est.rho
#[1,] 352.4557 4.275291e-05 0.7711327
#[2,] 461.2733 4.060286e-04 0.7004719
#[3,] 340.2024 3.159626e-05 0.7790893
#[4,] 368.3967 6.227648e-05 0.7607814
#[5,] 342.4391 3.341956e-05 0.7776369

这篇关于如何从数据子集中随机抽取并引导R中的统计检验的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆