多次拟合回归并收集摘要统计信息 [英] Fitting regression multiple times and gather summary statistics
问题描述
我有一个看起来像这样的数据框:
I have a dataframe that looks like this:
W01 0.750000 0.916667 0.642857 1.000000 0.619565
W02 0.880000 0.944444 0.500000 0.991228 0.675439
W03 0.729167 0.900000 0.444444 1.000000 0.611111
W04 0.809524 0.869565 0.500000 1.000000 0.709091
W05 0.625000 0.925926 0.653846 1.000000 0.589286
Variation 1_941119_A/G 1_942335_C/G 1_942451_T/C 1_942934_G/C \
W01 0.967391 0.965909 1 0.130435
W02 0.929825 0.937500 1 0.184211
W03 0.925926 0.880000 1 0.138889
W04 0.918182 0.907407 1 0.200000
W05 0.901786 0.858491 1 0.178571
Variation 1_944296_G/A ... X_155545046_C/T X_155774775_G/T \
W01 0.978261 ... 0.652174 0.641304
W02 0.938596 ... 0.728070 0.736842
W03 0.944444 ... 0.675926 0.685185
W04 0.927273 ... 0.800000 0.690909
W05 0.901786 ... 0.794643 0.705357
Variation Y_5100327_G/T Y_5100614_T/G Y_12786160_G/A Y_12914512_C/A \
W01 0.807692 0.800000 0.730769 0.807692
W02 0.655172 0.653846 0.551724 0.666667
W03 0.880000 0.909091 0.833333 0.916667
W04 0.666667 0.642857 0.580645 0.678571
W05 0.730769 0.720000 0.692308 0.720000
Variation Y_13470103_G/A Y_19705901_A/G Y_20587967_A/C mean_age
W01 0.807692 0.666667 0.333333 56.3
W02 0.678571 0.520000 0.250000 66.3
W03 0.916667 0.764706 0.291667 69.7
W04 0.666667 0.560000 0.322581 71.6
W05 0.703704 0.600000 0.346154 72.5
[5 rows x 67000 columns]
我想对每列拟合一个简单的最小二乘线性回归和Thiel-Sen线性回归作为自变量,将均值作为响应变量,并收集包括slope
,intercept
,<每次拟合都使用c2>,p value
和std err
,最好将其输出收集为数据仓库!
I would like to fit a simple Least squares linear regression and Thiel-Sen linear regression for each column as an independent variable and mean-age as the response variable and gather summary statistics including the slope
, intercept
, r value
, p value
and std err
for each fit and preferably gathers the outputs as a datafarme!
到目前为止,我一直在对'df'进行切片,并对每个列分别进行回归分析:
So far, I have been slicing my 'df' and carrying out regression analysis for each column separately:
from scipy import stats
import time
# Start timer
start_time = time.time()
# Select only 'Variation of interest' and 'mean_age' columns
r1 = tdf [['1_944296_G/A', 'mean_age']]
# Use scipy lingress function to perform linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(tdf['mean_age'], \
tdf['1_69270_A/G'])
print('The p-value between the 2 variables is measured as ' + str(p_value) + '\n')
print('Least squares linear model coefficients, intercept = ' + str(intercept) + \
'. Slope = ' + str(slope)+'\n')
# Create regression line
regressLine = intercept + tdf['mean_age']*slope
# Regression using Theil-Sen with 95% confidence intervals
res = stats.theilslopes(tdf['1_69270_A/G'], tdf['mean_age'], 0.95)
print('Thiel-Sen linear model coefficients, intercept = ' + str(res[1]) + '. Slope = ' + \
str(res[0]) +'\n')
# Scatter plot the temperature
plt.clf()
plt.scatter(tdf['mean_age'], tdf['1_69270_A/G'], s = 3, label = 'Allele frequency')
# Add least squares regression line
plt.plot(tdf['mean_age'], regressLine, label = 'Least squares regression line');
# Add Theil-Sen regression line
plt.plot(tdf['mean_age'], res[1] + res[0] * tdf['mean_age'], 'r-', label = 'Theil-Sen regression line')
# Add Theil-Sen confidence intervals
plt.plot(tdf['mean_age'], res[1] + res[2] * tdf['mean_age'], 'r--', label = 'Theil-Sen 95% confidence interval')
plt.plot(tdf['mean_age'], res[1] + res[3] * tdf['mean_age'], 'r--')
# Add legend, axis limits and save to png
plt.legend(loc = 'upper left')
#plt.ylim(7,14); plt.xlim(1755, 2016)
plt.xlabel('Year'); plt.ylabel('Temperature (C)')
plt.savefig('pythonRegress.png')
# End timer
end_time = time.time()
print('Elapsed time = ' + str(end_time - start_time) + ' seconds')
我想知道如何在每个列的迭代循环中进行此分析,并将最终结果收集在一个综合数据框中.
I was wondering how I could carry out this analysis in an iterative loop for each column and gather the final results in a comprehensive dataframe.
我看过[this](循环回归并获得矩阵形式的摘要统计循环回归并以矩阵形式获取摘要统计 )!却不是我期望的输出.赞赏Python或R中的任何解决方案!
I have seen [this](Looping regression and obtaining summary statistics in matrix form"Looping regression and obtaining summary statistics in matrix form ")! but not quite the output I expect. Any solution in Python or R is appreciated!
推荐答案
I think you will find this guide useful: Running a model on separate groups.
让我们生成一些与您的示例数据相似的示例数据,其中包含两个变量的值和平均年龄.我们还需要一些软件包:
Let's generate some example data similar to yours, with values for two variants and mean age. We also need a few packages:
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
set.seed(1001)
data1 <- data.frame(mean_age = sample(40:80, 50, replace = TRUE),
snp01 = rnorm(50),
snp02 = rnorm(50))
第一步是使用gather
从宽"格式转换为长"格式,因此变体名称位于一列中,值位于另一列中.然后我们可以按变体名称nest
.
The first step is to transform from "wide" to "long" format using gather
, so as variant names are in one column and values in another. Then we can nest
by variant name.
data1 %>%
gather(snp, value, -mean_age) %>%
nest(-snp)
这会创建一个小标题(特殊数据框),其中第二列data
是列表列"-它包含平均年龄和该行中变体的值:
This creates a tibble (a special data frame) where the second column, data
is a "list column" - it contains mean ages and the values for the variant in that row:
# A tibble: 2 x 2
snp data
<chr> <list>
1 snp01 <tibble [50 x 2]>
2 snp02 <tibble [50 x 2]>
现在,我们使用purrr::map
为每一行创建带有线性模型的第三列:
Now we use purrr::map
to create a third column with the linear model for each row:
data1 %>%
gather(snp, value, -mean_age) %>%
nest(-snp) %>%
mutate(model = map(data, ~lm(mean_age ~ value, data = .)))
结果:
# A tibble: 2 x 3
snp data model
<chr> <list> <list>
1 snp01 <tibble [50 x 2]> <lm>
2 snp02 <tibble [50 x 2]> <lm>
最后一步是根据需要汇总模型,然后unnest
数据结构.我正在使用broom::glance()
.完整程序:
The last step is to summarise the models as desired, then unnest
the data structure. I'm using broom::glance()
. The full procedure:
data1 %>%
gather(snp, value, -mean_age) %>%
nest(-snp) %>%
mutate(model = map(data, ~lm(mean_age ~ value, data = .)),
summary = map(model, glance)) %>%
select(-data, -model) %>%
unnest(summary)
结果:
# A tibble: 2 x 12
snp r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 snp01 0.00732 -0.0134 12.0 0.354 0.555 2 -194. 394. 400. 6901. 48
2 snp02 0.0108 -0.00981 12.0 0.524 0.473 2 -194. 394. 400. 6877. 48
这篇关于多次拟合回归并收集摘要统计信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!