多次拟合回归并收集摘要统计信息 [英] Fitting regression multiple times and gather summary statistics

查看:89
本文介绍了多次拟合回归并收集摘要统计信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的数据框:

I have a dataframe that looks like this:

W01           0.750000     0.916667     0.642857      1.000000      0.619565   
W02           0.880000     0.944444     0.500000      0.991228      0.675439   
W03           0.729167     0.900000     0.444444      1.000000      0.611111   
W04           0.809524     0.869565     0.500000      1.000000      0.709091   
W05           0.625000     0.925926     0.653846      1.000000      0.589286   

Variation  1_941119_A/G  1_942335_C/G  1_942451_T/C  1_942934_G/C  \
W01            0.967391      0.965909             1      0.130435   
W02            0.929825      0.937500             1      0.184211   
W03            0.925926      0.880000             1      0.138889   
W04            0.918182      0.907407             1      0.200000   
W05            0.901786      0.858491             1      0.178571   

Variation  1_944296_G/A    ...     X_155545046_C/T  X_155774775_G/T  \
W01            0.978261    ...            0.652174         0.641304   
W02            0.938596    ...            0.728070         0.736842   
W03            0.944444    ...            0.675926         0.685185   
W04            0.927273    ...            0.800000         0.690909   
W05            0.901786    ...            0.794643         0.705357   

Variation  Y_5100327_G/T  Y_5100614_T/G  Y_12786160_G/A  Y_12914512_C/A  \
W01             0.807692       0.800000        0.730769        0.807692   
W02             0.655172       0.653846        0.551724        0.666667   
W03             0.880000       0.909091        0.833333        0.916667   
W04             0.666667       0.642857        0.580645        0.678571   
W05             0.730769       0.720000        0.692308        0.720000   

Variation  Y_13470103_G/A  Y_19705901_A/G  Y_20587967_A/C  mean_age  
W01              0.807692        0.666667        0.333333      56.3  
W02              0.678571        0.520000        0.250000      66.3  
W03              0.916667        0.764706        0.291667      69.7  
W04              0.666667        0.560000        0.322581      71.6  
W05              0.703704        0.600000        0.346154      72.5  

[5 rows x 67000 columns]

我想对每列拟合一个简单的最小二乘线性回归和Thiel-Sen线性回归作为自变量,将均值作为响应变量,并收集包括slopeintercept,<每次拟合都使用c2>,p valuestd err,最好将其输出收集为数据仓库!

I would like to fit a simple Least squares linear regression and Thiel-Sen linear regression for each column as an independent variable and mean-age as the response variable and gather summary statistics including the slope, intercept, r value, p value and std err for each fit and preferably gathers the outputs as a datafarme!

到目前为止,我一直在对'df'进行切片,并对每个列分别进行回归分析:

So far, I have been slicing my 'df' and carrying out regression analysis for each column separately:

from scipy import stats
import time

# Start timer
start_time = time.time()

# Select only 'Variation of interest' and 'mean_age' columns
r1 = tdf [['1_944296_G/A', 'mean_age']]

# Use scipy lingress function to perform linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(tdf['mean_age'], \
    tdf['1_69270_A/G'])

print('The p-value between the 2 variables is measured as ' + str(p_value) + '\n')
print('Least squares linear model coefficients, intercept = ' + str(intercept) + \
  '. Slope = ' + str(slope)+'\n')

# Create regression line
regressLine = intercept + tdf['mean_age']*slope

# Regression using Theil-Sen with 95% confidence intervals 
res = stats.theilslopes(tdf['1_69270_A/G'], tdf['mean_age'], 0.95)

print('Thiel-Sen linear model coefficients, intercept = ' + str(res[1]) + '. Slope = ' + \
  str(res[0]) +'\n')

# Scatter plot the temperature
plt.clf()
plt.scatter(tdf['mean_age'], tdf['1_69270_A/G'], s = 3, label = 'Allele frequency')

# Add least squares regression line
plt.plot(tdf['mean_age'], regressLine, label = 'Least squares regression line'); 

# Add Theil-Sen regression line
plt.plot(tdf['mean_age'], res[1] + res[0] * tdf['mean_age'], 'r-', label = 'Theil-Sen regression line')

# Add Theil-Sen confidence intervals
plt.plot(tdf['mean_age'], res[1] + res[2] * tdf['mean_age'], 'r--', label = 'Theil-Sen 95% confidence interval')
plt.plot(tdf['mean_age'], res[1] + res[3] * tdf['mean_age'], 'r--')

# Add legend, axis limits and save to png
plt.legend(loc = 'upper left')
#plt.ylim(7,14); plt.xlim(1755, 2016)
plt.xlabel('Year'); plt.ylabel('Temperature (C)')
plt.savefig('pythonRegress.png')

# End timer
end_time = time.time()
print('Elapsed time = ' + str(end_time - start_time) + ' seconds')

我想知道如何在每个列的迭代循环中进行此分析,并将最终结果收集在一个综合数据框中.

I was wondering how I could carry out this analysis in an iterative loop for each column and gather the final results in a comprehensive dataframe.

我看过[this](循环回归并获得矩阵形式的摘要统计循环回归并以矩阵形式获取摘要统计 )!却不是我期望的输出.赞赏Python或R中的任何解决方案!

I have seen [this](Looping regression and obtaining summary statistics in matrix form"Looping regression and obtaining summary statistics in matrix form ")! but not quite the output I expect. Any solution in Python or R is appreciated!

推荐答案

我认为您会发现本指南很有用:

I think you will find this guide useful: Running a model on separate groups.

让我们生成一些与您的示例数据相似的示例数据,其中包含两个变量的值和平均年龄.我们还需要一些软件包:

Let's generate some example data similar to yours, with values for two variants and mean age. We also need a few packages:

library(dplyr)
library(tidyr)
library(purrr)
library(broom)

set.seed(1001)
data1 <- data.frame(mean_age = sample(40:80, 50, replace = TRUE), 
                    snp01 = rnorm(50), 
                    snp02 = rnorm(50))

第一步是使用gather从宽"格式转换为长"格式,因此变体名称位于一列中,值位于另一列中.然后我们可以按变体名称nest.

The first step is to transform from "wide" to "long" format using gather, so as variant names are in one column and values in another. Then we can nest by variant name.

data1 %>% 
  gather(snp, value, -mean_age) %>% 
  nest(-snp)

这会创建一个小标题(特殊数据框),其中第二列data是列表列"-它包含平均年龄和该行中变体的值:

This creates a tibble (a special data frame) where the second column, data is a "list column" - it contains mean ages and the values for the variant in that row:

# A tibble: 2 x 2
  snp   data             
  <chr> <list>           
1 snp01 <tibble [50 x 2]>
2 snp02 <tibble [50 x 2]>

现在,我们使用purrr::map为每一行创建带有线性模型的第三列:

Now we use purrr::map to create a third column with the linear model for each row:

data1 %>% 
  gather(snp, value, -mean_age) %>% 
  nest(-snp) %>% 
  mutate(model = map(data, ~lm(mean_age ~ value, data = .)))

结果:

# A tibble: 2 x 3
  snp   data              model 
  <chr> <list>            <list>
1 snp01 <tibble [50 x 2]> <lm>  
2 snp02 <tibble [50 x 2]> <lm>

最后一步是根据需要汇总模型,然后unnest数据结构.我正在使用broom::glance().完整程序:

The last step is to summarise the models as desired, then unnest the data structure. I'm using broom::glance(). The full procedure:

data1 %>% 
  gather(snp, value, -mean_age) %>% 
  nest(-snp) %>% 
  mutate(model = map(data, ~lm(mean_age ~ value, data = .)), 
         summary = map(model, glance)) %>% 
  select(-data, -model) %>% 
  unnest(summary)

结果:

# A tibble: 2 x 12
  snp   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC deviance df.residual
  <chr>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
1 snp01   0.00732      -0.0134   12.0     0.354   0.555     2  -194.  394.  400.    6901.          48
2 snp02   0.0108       -0.00981  12.0     0.524   0.473     2  -194.  394.  400.    6877.          48

这篇关于多次拟合回归并收集摘要统计信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆