从pandas DataFrame计算pvalue [英] Calculate pvalue from pandas DataFrame
问题描述
我有一个DataFrame统计信息,其中包含一个Multindex和8个样本(此处仅显示两个),每个样本有8个基因。
I have a DataFrame stats with a Multindex and 8 samples (only two shown here) and 8 genes for each sample.
In[13]:stats
Out[13]:
ARG/16S \
count mean std min
sample gene
Arnhem IC 11.0 2.319050e-03 7.396130e-04 1.503150e-03
Int1 11.0 7.243040e+00 6.848327e+00 1.364879e+00
Sul1 11.0 3.968956e-03 9.186019e-04 2.499074e-03
TetB 2.0 1.154748e-01 1.627663e-01 3.816936e-04
TetM 4.0 1.083125e-04 5.185259e-05 5.189226e-05
blaOXA 4.0 4.210963e-06 3.783235e-07 3.843571e-06
ermB 4.0 4.111081e-05 7.894879e-06 3.288865e-05
ermF 4.0 2.335210e-05 4.519758e-06 1.832037e-05
Basel Aph3a 4.0 7.815592e-06 1.757242e-06 5.539389e-06
IC 11.0 5.095161e-03 5.639278e-03 1.302205e-03
Int1 12.0 1.333068e+01 1.872207e+01 4.988048e-02
Sul1 11.0 1.618617e-02 1.988817e-02 2.970397e-03
我正在尝试计算p值(学生t检验),比较它们之间的每个基因。
I'm trying to calculate the p-value (Students t-test) for each of these samples, comparing each of the genes between them.
我使用了scipy.stats.ttest_ind_from_stats,但我设法
I've used scipy.stats.ttest_ind_from_stats but I managed to get the p-values for the different samples for one gene and only those of the samples neighboring each other.
Experiments = list(values1_16S['sample'].unique())
for exp in Experiments:
if Experiments.index(exp)<len(Experiments)-1:
second = Experiments[Experiments.index(exp)+1]
else:
second = Experiments[0]
tstat, pvalue = scipy.stats.ttest_ind_from_stats(stats.loc[(exp,'Sul1')]['ARG/16S','mean'],
stats.loc[(exp,'Sul1')]['ARG/16S','std'],
stats.loc[(exp,'Sul1')]['ARG/16S','count'],
stats.loc[(second,'Sul1')]['ARG/16S','mean'],
stats.loc[(second,'Sul1')]['ARG/16S','std'],
stats.loc[(second,'Sul1')]['ARG/16S','count'])
d.append({'loc1':exp, 'loc2':second, 'pvalue':pvalue})
stats_Sul1 = pd.DataFrame(d)
stats_Sul1
如何获取所有样本之间的p值?有没有一种方法可以一次对所有基因执行此操作,而不必每个基因一个接一个地运行代码?
How can I get the pvalues between ALL samples? And is there a way to do this for all genes at once without running the code one by one for each gene?
推荐答案
让我们假设您的Y样本具有相同的X基因。我用X = 3和Y = 2尝试我的方法,但我想您可以一概而论。我开始是:
Let's suppose you have the same X genes for the Y samples. I try my method with X=3 and Y=2 but I guess you can generalize. I started with:
df1 =
count mean std min
sample gene
Arnhem IC 11 0.002319 0.000740 0.001503
Int1 11 7.243040 6.848327 1.364879
Sul1 11 0.003969 0.000919 0.002499
Basel IC 11 0.005095 0.005639 0.001302
Int1 12 13.330680 18.722070 0.049880
Sul1 11 0.016186 0.019888 0.002970
请注意,基因的顺序必须相同。
首先 reset_index()
与 df_reindex = df1.reset_index()
,我不确定我要使用multiindex可以做到:
Note that the genes need to be in the same order.
First reset_index()
with df_reindex = df1.reset_index()
, I'm not sure what I'm doing is possible with multiindex:
df_reindex =
sample gene count mean std min
0 Arnhem IC 11 0.002319 0.000740 0.001503
1 Arnhem Int1 11 7.243040 6.848327 1.364879
2 Arnhem Sul1 11 0.003969 0.000919 0.002499
3 Basel IC 11 0.005095 0.005639 0.001302
4 Basel Int1 12 13.330680 18.722070 0.049880
5 Basel Sul1 11 0.016186 0.019888 0.002970
我创建一个滚动DF并将其加入 df_reindex
:
I create a rolled DF and join it to df_reindex
:
nb_genes = 3
df_rolled = pd.DataFrame(pd.np.roll(df_reindex,nb_genes,0), columns = df_reindex.columns)
df_joined = df_reindex.join(df_rolled, rsuffix='_')
# rsuffix='_' is to be able to perform the join
现在在同一行上,我拥有计算 pvalue
所需的所有数据,并使用 apply
创建列:
Now on a same row, I have all data you needto calculate pvalue
and create the column with apply
:
df_joined['pvalue'] = df_joined.apply(lambda x: stats.ttest_ind_from_stats(x['mean'],x['std'],x['count'], x['mean_'],x['std_'],x['count_'])[1],axis=1)
最后,我使用所需数据创建DF并重命名列:
Finally, I create a DF with the data you want and rename columns:
df_output = df_joined[['sample','sample_','gene','pvalue']].rename(columns = {'sample':'loc1', 'sample_':'loc2'})
您最终得到了数据:
df_output =
loc1 loc2 gene pvalue
0 Arnhem Basel IC 0.121142
1 Arnhem Basel Int1 0.321072
2 Arnhem Basel Sul1 0.055298
3 Basel Arnhem IC 0.121142
4 Basel Arnhem Int1 0.321072
5 Basel Arnhem Sul1 0.055298
如果您愿意,可以重新索引ch样本彼此相对,我认为对于的循环
可以做到这一点。
If you want to do it each sample against each other, I think a loop for
could do it.
编辑:使用数据透视表
,我认为有一种更简单的方法。
Using pivot_table
, I think there is a easier way.
使用您的输入 stats
作为仅用于 ARG / 16S
的多索引表(不确定如何处理此级别),所以我从(可能是您的 stats ['ARG / 16S']
):
With your input stats
as multiindex table for only ARG/16S
(not sure how to handle this level), so I start with (which might be your stats['ARG/16S']
):
df=
count mean std min
sample gene
Arnhem IC 11 0.002319 7.396130e-04 0.001503
Int1 11 7.243040 6.848327e+00 1.364879
Sul1 11 0.003969 9.186019e-04 0.002499
TetB 2 0.115475 1.627663e-01 0.000382
TetM 4 0.000108 5.185259e-05 0.000052
blaOXA 4 0.000004 3.783235e-07 0.000004
ermB 4 0.000041 7.894879e-06 0.000033
ermF 4 0.000023 4.519758e-06 0.000018
Basel Aph3a 4 0.000008 1.757242e-06 0.000006
IC 11 0.005095 5.639278e-03 0.001302
Int1 12 13.330680 1.872207e+01 0.049880
Sul1 11 0.016186 1.988817e-02 0.002970
具有功能 pivot_table
,您可以重新排列数据,例如:
With the function pivot_table
, you can rearrange your data such as:
df_pivot = df.pivot_table(values = ['count','mean','std'], index = 'gene',
columns = 'sample', fill_value = 0)
在此 df_pivot
中(出于可读性考虑,我不在这里打印,但在新列的结尾处),您可以为每对夫妇创建一列( sample1,sample2)使用 itertools
和 apply
:
In this df_pivot
(I don't print it here for readability but at the end with the new column), you can create a column for each couple (sample1, sample2) using itertools
and apply
:
import itertools
for sample1, sample2 in itertools.combinations(df.index.levels[0],2):
# itertools.combinations create all combinations between your samples
df_pivot[sample1+ '_' + sample2 ] = df_pivot.apply(lambda x: stats.ttest_ind_from_stats(x['mean'][sample1],x['std'][sample1],x['count'][sample1],
x['mean'][sample2 ],x['std'][sample2 ],x['count'][sample2 ],)[1],axis=1).fillna(1)
我认为这种方法与样本,基因的数量以及是否基因并非完全一样,您最终得到 df_pivot
像:
I think this method is independent of the number of samples, genes and if genes are not all the same, you ends up with df_pivot
like:
count mean std Arnhem_Basel
sample Arnhem Basel Arnhem Basel Arnhem Basel
gene
Aph3a 0 4 0.000000 0.000008 0.000000e+00 0.000002 1.000000
IC 11 11 0.002319 0.005095 7.396130e-04 0.005639 0.121142
Int1 11 12 7.243040 13.330680 6.848327e+00 18.722070 0.321072
Sul1 11 11 0.003969 0.016186 9.186019e-04 0.019888 0.055298
TetB 2 0 0.115475 0.000000 1.627663e-01 0.000000 1.000000
TetM 4 0 0.000108 0.000000 5.185259e-05 0.000000 1.000000
blaOXA 4 0 0.000004 0.000000 3.783235e-07 0.000000 1.000000
ermB 4 0 0.000041 0.000000 7.894879e-06 0.000000 1.000000
ermF 4 0 0.000023 0.000000 4.519758e-06 0.000000 1.000000
让我知道它是否有效
EDIT2:要回复评论,我想您可以这样做:
to reply to the comment, I think you can do this:
df_pivot $ c $不变c>,然后创建一个多索引DF
df_multi
将结果写入以下内容:
No change for df_pivot
and then you create a multiindex DF df_multi
to write your results in:
df_multi = pd.DataFrame(index = df.index.levels[1],
columns = pd.MultiIndex.from_tuples([p for p in itertools.combinations(df.index.levels[0],2)])).fillna(0)
然后使用循环 for
在此 df_multi
中实现数据:
Then you use the loop for
to implement the data in this df_multi
:
for sample1, sample2 in itertools.combinations(df.index.levels[0],2):
# itertools.combinations create all combinations between your samples
df_multi.loc[:,(sample1,sample2)] = df_pivot.apply(lambda x: stats.ttest_ind_from_stats(x['mean'][sample1],x['std'][sample1],x['count'][sample1],
x['mean'][sample2 ],x['std'][sample2 ],x['count'][sample2 ],)[1],axis=1).fillna(1)
最后,您可以使用 transpose
和 unstack
在第1层获得您询问的方式(或者如果我误解了,请关闭)
Finally, you can use transpose
and unstack
on level 1 to get the way you ask (or close if I misunderstood)
df_output = df_multi.transpose().unstack(level=[1]).fillna(1)
您将看到您没有索引中的最后一个样本,而列中没有第一个样本(因为它们不存在我构建所有内容的方式),如果需要它们,您需要替换 itertools.combinations
通过 itertools.combinations_with_replacement
在创建 df_multi
和循环 for
(我没有尝试过,但是应该可以)
You will see that you don't have the last sample in indexes and first sample in columns (because they don't exist how I built everything) if you want them, you need to replace itertools.combinations
by itertools.combinations_with_replacement
in both the creation of df_multi
and in the loop for
( I didn't try it but it should work)
这篇关于从pandas DataFrame计算pvalue的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!