通过自举计算相关系数 [英] Calculate correlation coefficient by bootstrapping

查看:91
本文介绍了通过自举计算相关系数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究5种鸟类在一年中开始换羽的天数与这5种鸟类完成羽毛换羽所花费的天数之间的相关性.

I'm looking at the correlation between the day of the year that 5 species of bird started moulting their feathers and the numbers of days it took these 5 species to complete the moulting of their feathers.

我试图在下面的代码中模拟我的数据.对于这5个物种中的每一个,我都有10个人的开始日和10个人的持续时间.对于每种物种,我计算了平均开始日和平均持续时间,然后计算了这5种物种之间的相关性.

I've tried to simulate my data in the code below. For each of the 5 species, I have start day for 10 individuals and the durations for 10 individuals. For each species, I calculated the mean start day and mean duration then calculated the correlation across these 5 species.

我想要做的是引导每个物种的平均开始日期并引导平均持续时间.我想重复10,000次,并在每次重复后计算相关系数.然后,我要提取10,000个相关系数的0.025、0.5和0.975分位数.

What I want to do is bootstrap the mean start date and bootstrap the mean duration for each species. I want to repeat this 10,000 times and calculate the correlation coefficient after each repeat. I then want to extract the 0.025, 0.5 and 0.975 quantiles of the 10,000 correlation coefficients.

我可以模拟原始数据,但是一旦尝试进行引导,我的代码很快就会变得混乱.有人可以帮我吗?

I got as far as simulating the raw data, but my code quickly got messy once I tried to bootstrap. Can anyone help me with this?

# speciesXX_start_day is the day of the year that 10 individuals of birds started moulting their feathers
# speciesXX_duration is the number of days that each individuals bird took to complete the moulting of its feathers
species1_start_day <- as.integer(rnorm(10, 10, 2))
species1_duration <- as.integer(rnorm(10, 100, 2))

species2_start_day <- as.integer(rnorm(10, 20, 2))
species2_duration <- as.integer(rnorm(10, 101, 2))

species3_start_day <- as.integer(rnorm(10, 30, 2))
species3_duration <- as.integer(rnorm(10, 102, 2))

species4_start_day <- as.integer(rnorm(10, 40, 2))
species4_duration <- as.integer(rnorm(10, 103, 2))

species5_start_day <- as.integer(rnorm(10, 50, 2))
species5_duration <- as.integer(rnorm(10, 104, 2))

start_dates <- list(species1_start_day, species2_start_day, species3_start_day, species4_start_day, species5_start_day)
start_duration <- list(species1_duration, species2_duration, species3_duration, species4_duration, species5_duration)

library(plyr)

# mean start date for each of the 5 species
starts_mean <- laply(start_dates, mean)

# mean duration for each of the 5 species
durations_mean <- laply(start_duration, mean)

# correlation between start date and duration
cor(starts_mean, durations_mean)

推荐答案

R允许您使用 sample 函数对数据集进行重新采样.为了进行引导,您可以只对原始数据集进行随机采样(替换),然后重新计算每个子采样的统计信息.您可以将中间结果保存在数据结构中,以便以后可以处理数据.

R allows you to resample datasets with the sample function. In order to bootstrap you can just take random samples (with replacement) of your original dataset and then recalculate the statistics for each subsample. You can save the intermediate results in a datastructure so that you can process the data afterwards.

下面添加了针对您的特定问题的可能示例解决方案.我们为每个物种抽取10000个大小为3的子样本,计算统计量,然后将结果保存在列表或向量中.引导后,我们可以处理所有数据:

A possible example solution for your specific problem is added below. We take 10000 subsamples of size 3 for each of the species, calculate the statistics and then save the results in a list or vector. After the bootstrap we are able to process all the data:

nrSamples = 10000;
listOfMeanStart = list(nrSamples)
listOfMeanDuration = list(nrSamples)
correlations <- vector(mode="numeric", length=nrSamples)

for(i in seq(1,nrSamples))
{
  sampleStartDate = sapply(start_dates,sample,size=3,replace=TRUE)
  sampleDurations = sapply(start_duration,sample,size=3,replace=TRUE)

  listOfMeans[[i]] <- apply(sampleStartDate,2,mean) 
  listOfMeanDuration[[i]] <- apply(sampleDurations,2,mean)
  correlations[i] <- cor(listOfMeans[[i]], listOfMeanDuration[[i]])
}

quantile(correlations,c(0.025,.5,0.975))

这篇关于通过自举计算相关系数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆