从pandas DataFrame计算pvalue [英] Calculate pvalue from pandas DataFrame

查看:465
本文介绍了从pandas DataFrame计算pvalue的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个DataFrame统计信息,其中包含一个Multindex和8个样本(此处仅显示两个),每个样本有8个基因。

I have a DataFrame stats with a Multindex and 8 samples (only two shown here) and 8 genes for each sample.

 In[13]:stats
    Out[13]: 
                       ARG/16S                                            \
                         count          mean           std           min   
    sample      gene                                                       
    Arnhem      IC        11.0  2.319050e-03  7.396130e-04  1.503150e-03   
                Int1      11.0  7.243040e+00  6.848327e+00  1.364879e+00   
                Sul1      11.0  3.968956e-03  9.186019e-04  2.499074e-03   
                TetB       2.0  1.154748e-01  1.627663e-01  3.816936e-04   
                TetM       4.0  1.083125e-04  5.185259e-05  5.189226e-05   
                blaOXA     4.0  4.210963e-06  3.783235e-07  3.843571e-06   
                ermB       4.0  4.111081e-05  7.894879e-06  3.288865e-05   
                ermF       4.0  2.335210e-05  4.519758e-06  1.832037e-05   
    Basel       Aph3a      4.0  7.815592e-06  1.757242e-06  5.539389e-06   
                IC        11.0  5.095161e-03  5.639278e-03  1.302205e-03   
                Int1      12.0  1.333068e+01  1.872207e+01  4.988048e-02   
                Sul1      11.0  1.618617e-02  1.988817e-02  2.970397e-03   

我正在尝试计算p值(学生t检验),比较它们之间的每个基因。

I'm trying to calculate the p-value (Students t-test) for each of these samples, comparing each of the genes between them.

我使用了scipy.stats.ttest_ind_from_stats,但我设法

I've used scipy.stats.ttest_ind_from_stats but I managed to get the p-values for the different samples for one gene and only those of the samples neighboring each other.

Experiments = list(values1_16S['sample'].unique())
for exp in Experiments:
    if Experiments.index(exp)<len(Experiments)-1:
        second = Experiments[Experiments.index(exp)+1]
    else:
        second = Experiments[0]
    tstat, pvalue = scipy.stats.ttest_ind_from_stats(stats.loc[(exp,'Sul1')]['ARG/16S','mean'],
                                    stats.loc[(exp,'Sul1')]['ARG/16S','std'],
                                    stats.loc[(exp,'Sul1')]['ARG/16S','count'],
                                    stats.loc[(second,'Sul1')]['ARG/16S','mean'],
                                    stats.loc[(second,'Sul1')]['ARG/16S','std'],
                                    stats.loc[(second,'Sul1')]['ARG/16S','count'])
    d.append({'loc1':exp, 'loc2':second, 'pvalue':pvalue})


stats_Sul1 = pd.DataFrame(d)
stats_Sul1

如何获取所有样本之间的p值?有没有一种方法可以一次对所有基因执行此操作,而不必每个基因一个接一个地运行代码?

How can I get the pvalues between ALL samples? And is there a way to do this for all genes at once without running the code one by one for each gene?

推荐答案

让我们假设您的Y样本具有相同的X基因。我用X = 3和Y = 2尝试我的方法,但我想您可以一概而论。我开始是:

Let's suppose you have the same X genes for the Y samples. I try my method with X=3 and Y=2 but I guess you can generalize. I started with:

df1 = 
             count       mean        std       min
sample gene                                       
Arnhem IC       11   0.002319   0.000740  0.001503
       Int1     11   7.243040   6.848327  1.364879
       Sul1     11   0.003969   0.000919  0.002499
Basel  IC       11   0.005095   0.005639  0.001302
       Int1     12  13.330680  18.722070  0.049880
       Sul1     11   0.016186   0.019888  0.002970

请注意,基因的顺序必须相同。
首先 reset_index() df_reindex = df1.reset_index(),我不确定我要使用multiindex可以做到:

Note that the genes need to be in the same order. First reset_index() with df_reindex = df1.reset_index(), I'm not sure what I'm doing is possible with multiindex:

df_reindex =
   sample  gene  count       mean        std       min
0  Arnhem    IC     11   0.002319   0.000740  0.001503
1  Arnhem  Int1     11   7.243040   6.848327  1.364879
2  Arnhem  Sul1     11   0.003969   0.000919  0.002499
3   Basel    IC     11   0.005095   0.005639  0.001302
4   Basel  Int1     12  13.330680  18.722070  0.049880
5   Basel  Sul1     11   0.016186   0.019888  0.002970

我创建一个滚动DF并将其加入 df_reindex

I create a rolled DF and join it to df_reindex:

nb_genes = 3
df_rolled = pd.DataFrame(pd.np.roll(df_reindex,nb_genes,0), columns = df_reindex.columns)
df_joined = df_reindex.join(df_rolled, rsuffix='_')
# rsuffix='_' is to be able to perform the join

现在在同一行上,我拥有计算 pvalue 所需的所有数据,并使用 apply 创建列:

Now on a same row, I have all data you needto calculate pvalue and create the column with apply:

df_joined['pvalue'] = df_joined.apply(lambda x: stats.ttest_ind_from_stats(x['mean'],x['std'],x['count'], x['mean_'],x['std_'],x['count_'])[1],axis=1)

最后,我使用所需数据创建DF并重命名列:

Finally, I create a DF with the data you want and rename columns:

df_output = df_joined[['sample','sample_','gene','pvalue']].rename(columns = {'sample':'loc1', 'sample_':'loc2'})

您最终得到了数据:

df_output = 
     loc1    loc2  gene    pvalue
0  Arnhem   Basel    IC  0.121142
1  Arnhem   Basel  Int1  0.321072
2  Arnhem   Basel  Sul1  0.055298
3   Basel  Arnhem    IC  0.121142
4   Basel  Arnhem  Int1  0.321072
5   Basel  Arnhem  Sul1  0.055298

如果您愿意,可以重新索引ch样本彼此相对,我认为对于的循环可以做到这一点。

If you want to do it each sample against each other, I think a loop for could do it.

编辑:使用数据透视表,我认为有一种更简单的方法。

Using pivot_table, I think there is a easier way.

使用您的输入 stats 作为仅用于 ARG / 16S 的多索引表(不确定如何处理此级别),所以我从(可能是您的 stats ['ARG / 16S'] ):

With your input stats as multiindex table for only ARG/16S (not sure how to handle this level), so I start with (which might be your stats['ARG/16S']):

df=
               count       mean           std       min
sample gene                                            
Arnhem IC         11   0.002319  7.396130e-04  0.001503
       Int1       11   7.243040  6.848327e+00  1.364879
       Sul1       11   0.003969  9.186019e-04  0.002499
       TetB        2   0.115475  1.627663e-01  0.000382
       TetM        4   0.000108  5.185259e-05  0.000052
       blaOXA      4   0.000004  3.783235e-07  0.000004
       ermB        4   0.000041  7.894879e-06  0.000033
       ermF        4   0.000023  4.519758e-06  0.000018
Basel  Aph3a       4   0.000008  1.757242e-06  0.000006
       IC         11   0.005095  5.639278e-03  0.001302
       Int1       12  13.330680  1.872207e+01  0.049880
       Sul1       11   0.016186  1.988817e-02  0.002970

具有功能 pivot_table ,您可以重新排列数据,例如:

With the function pivot_table, you can rearrange your data such as:

df_pivot = df.pivot_table(values = ['count','mean','std'], index = 'gene', 
                               columns = 'sample', fill_value = 0)

在此 df_pivot 中(出于可读性考虑,我不在这里打印,但在新列的结尾处),您可以为每对夫妇创建一列( sample1,sample2)使用 itertools apply

In this df_pivot (I don't print it here for readability but at the end with the new column), you can create a column for each couple (sample1, sample2) using itertools and apply:

import itertools
for sample1, sample2 in itertools.combinations(df.index.levels[0],2):
    # itertools.combinations create all combinations between your samples
    df_pivot[sample1+ '_' + sample2 ] = df_pivot.apply(lambda x: stats.ttest_ind_from_stats(x['mean'][sample1],x['std'][sample1],x['count'][sample1], 
                                                                                        x['mean'][sample2 ],x['std'][sample2 ],x['count'][sample2 ],)[1],axis=1).fillna(1)

我认为这种方法与样本,基因的数量以及是否基因并非完全一样,您最终得到 df_pivot 像:

I think this method is independent of the number of samples, genes and if genes are not all the same, you ends up with df_pivot like:

        count            mean                      std            Arnhem_Basel
sample Arnhem Basel    Arnhem      Basel        Arnhem      Basel             
gene                                                                          
Aph3a       0     4  0.000000   0.000008  0.000000e+00   0.000002     1.000000
IC         11    11  0.002319   0.005095  7.396130e-04   0.005639     0.121142
Int1       11    12  7.243040  13.330680  6.848327e+00  18.722070     0.321072
Sul1       11    11  0.003969   0.016186  9.186019e-04   0.019888     0.055298
TetB        2     0  0.115475   0.000000  1.627663e-01   0.000000     1.000000
TetM        4     0  0.000108   0.000000  5.185259e-05   0.000000     1.000000
blaOXA      4     0  0.000004   0.000000  3.783235e-07   0.000000     1.000000
ermB        4     0  0.000041   0.000000  7.894879e-06   0.000000     1.000000
ermF        4     0  0.000023   0.000000  4.519758e-06   0.000000     1.000000

让我知道它是否有效

EDIT2:要回复评论,我想您可以这样做:

to reply to the comment, I think you can do this:

df_pivot ,然后创建一个多索引DF df_multi 将结果写入以下内容:

No change for df_pivot and then you create a multiindex DF df_multi to write your results in:

df_multi = pd.DataFrame(index = df.index.levels[1], 
                        columns = pd.MultiIndex.from_tuples([p for p in itertools.combinations(df.index.levels[0],2)])).fillna(0)

然后使用循环 for 在此 df_multi 中实现数据:

Then you use the loop for to implement the data in this df_multi:

for sample1, sample2 in itertools.combinations(df.index.levels[0],2):
    # itertools.combinations create all combinations between your samples
    df_multi.loc[:,(sample1,sample2)] = df_pivot.apply(lambda x: stats.ttest_ind_from_stats(x['mean'][sample1],x['std'][sample1],x['count'][sample1], 
                                                                                        x['mean'][sample2 ],x['std'][sample2 ],x['count'][sample2 ],)[1],axis=1).fillna(1)

最后,您可以使用 transpose unstack 在第1层获得您询问的方式(或者如果我误解了,请关闭)

Finally, you can use transpose and unstack on level 1 to get the way you ask (or close if I misunderstood)

df_output = df_multi.transpose().unstack(level=[1]).fillna(1)

您将看到您没有索引中的最后一个样本,而列中没有第一个样本(因为它们不存在我构建所有内容的方式),如果需要它们,您需要替换 itertools.combinations 通过 itertools.combinations_with_replacement 在创建 df_multi 和循环 for (我没有尝试过,但是应该可以)

You will see that you don't have the last sample in indexes and first sample in columns (because they don't exist how I built everything) if you want them, you need to replace itertools.combinations by itertools.combinations_with_replacement in both the creation of df_multi and in the loop for ( I didn't try it but it should work)

这篇关于从pandas DataFrame计算pvalue的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆