使用 scipy 对数据帧内的组进行方差分析 [英] ANOVA for groups within a dataframe using scipy

查看:54
本文介绍了使用 scipy 对数据帧内的组进行方差分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下的数据框.我需要在三个条件之间对此进行方差分析.数据框看起来像:

data0 = pd.DataFrame({'Names': ['CTA15', 'CTA15', 'AC007', 'AC007', 'AC007','AC007'],'值': [22, 22, 2, 2, 2,5],'条件':['NON', 'NON', 'YES', 'YES', 'RE','RE']})

我需要在 YES 和 NON、NON 和 RE 以及 YES 和 RE 之间进行方差分析,这些条件来自名称的条件.我知道我可以这样做,

NON=df.query('condition =="NON"and Names=="CTA15"')没有= df.valueYES=df.query('condition =="YES"and Names=="CTA15"')Y=YES.value

然后执行如下单向方差分析,

 from scipy import statsf_val, p_val = stats.f_oneway(no, Y)打印(单向方差分析 P =",p_val)

但是如果有任何优雅的解决方案会很棒,因为我的初始数据框很大并且有很多名称和条件可供比较

解决方案

考虑以下示例 DataFrame:

df = pd.DataFrame({'Names': np.random.randint(1, 10, 1000),'值':np.random.randn(1000),'条件': np.random.choice(['NON', 'YES', 'RE'], 1000)})df.head()出去:名称条件值0 4 回复 0.8441201 4 非 -0.4402852 5 是 0.5594973 4 RE 0.4724254 9 是 0.205906

以下按名称对 DataFrame 进行分组,然后将每个条件组传递给 ANOVA:

将 scipy.stats 导入为 ss对于 df.groupby('Names') 中的 name_group:samples = [condition[1] for name_group[1].groupby('condition')['value']] 中的条件f_val, p_val = ss.f_oneway(*samples)print('名称:{},F 值:{:.3f},p 值:{:.3f}'.format(name_group[0], f_val, p_val))名称:1,F值:0.138,p值:0.871名称:2,F值:1.458,p值:0.237名称:3,F值:0.742,p值:0.479名称:4,F值:2.718,p值:0.071名称:5,F 值:0.255,p 值:0.776名称:6,F值:1.731,p值:0.182名称:7,F值:0.269,p值:0.764名称:8,F值:0.474,p值:0.624名称:9,F 值:1.226,p 值:0.297

对于事后测试,您可以使用 statsmodels(如此处所述):

from statsmodels.stats.multicomp import pairwise_tukeyhsd对于名称,df.groupby('Names') 中的 grouped_df:print('Name {}'.format(name), pairwise_tukeyhsd(grouped_df['value'], grouped_df['condition']))

<前>名称 1 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非 RE 0.0086 -0.5129 0.5301 错误非 是 0.0084 -0.4817 0.4986 错误RE 是 -0.0002 -0.5217 0.5214 错误-----------------------------------------名称 2 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非 RE -0.0089 -0.5299 0.5121 错误否 是 0.083 -0.4182 0.5842 错误RE 是 0.0919 -0.4008 0.5846 错误-----------------------------------------名称 3 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非 RE 0.2401 -0.3136 0.7938 错误非 是 0.2765 -0.2903 0.8432 错误RE 是 0.0364 -0.5052 0.578 错误-----------------------------------------名称 4 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非 RE 0.0894 -0.5825 0.7613 错误非 是 -0.0437 -0.7418 0.6544 假RE 是 -0.1331 -0.6949 0.4287 错误-----------------------------------------名称 5 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非 RE -0.4264 -0.9495 0.0967 错误否 是 0.0439 -0.4264 0.5142 错误RE 是 0.4703 -0.0155 0.9561 错误-----------------------------------------名称 6 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非 RE 0.0649 -0.4971 0.627 错误非 是 -0.406 -0.9405 0.1285 假RE 是 -0.4709 -1.0136 0.0717 错误-----------------------------------------名称 7 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非 RE 0.3111 -0.2766 0.8988 错误非 是 -0.1664 -0.7314 0.3987 假RE 是 -0.4774 -1.0688 0.114 错误-----------------------------------------名称 8 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非 RE -0.0224 -0.668 0.6233 错误否 是 0.0119 -0.668 0.6918 错误RE 是 0.0343 -0.6057 0.6742 错误-----------------------------------------名称 9 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非重新 -0.2414 -0.7792 0.2963 错误非 是 0.0696 -0.5746 0.7138 错误RE 是 0.311 -0.3129 0.935 错误

I have a dataframe as follows. I need to do ANOVA on this between three conditions. The dataframe looks like:

data0 = pd.DataFrame({'Names': ['CTA15', 'CTA15', 'AC007', 'AC007', 'AC007','AC007'], 
    'value': [22, 22, 2, 2, 2,5], 
    'condition':['NON', 'NON', 'YES', 'YES', 'RE','RE']})

I need to do ANOVA test between YES and NON, NON and RE and YES and RE, conditions from conditions for Names. I know I could do it like this,

NON=df.query('condition =="NON"and Names=="CTA15"')
no=df.value
YES=df.query('condition =="YES"and Names=="CTA15"')    
Y=YES.value

Then perform one way ANOVA as following,

    from scipy import stats                
    f_val, p_val = stats.f_oneway(no, Y)            
    print ("One-way ANOVA P =", p_val )

But would be great if there is any elegant solution as my initial data frame is big and has many names and conditions to compare between

解决方案

Consider the following sample DataFrame:

df = pd.DataFrame({'Names': np.random.randint(1, 10, 1000), 
                   'value': np.random.randn(1000), 
                   'condition': np.random.choice(['NON', 'YES', 'RE'], 1000)})

df.head()
Out: 
   Names condition     value
0      4        RE  0.844120
1      4       NON -0.440285
2      5       YES  0.559497
3      4        RE  0.472425
4      9       YES  0.205906

The following groups the DataFrame by Names, and then passes each condition group to ANOVA:

import scipy.stats as ss
for name_group in df.groupby('Names'):
    samples = [condition[1] for condition in name_group[1].groupby('condition')['value']]
    f_val, p_val = ss.f_oneway(*samples)
    print('Name: {}, F value: {:.3f}, p value: {:.3f}'.format(name_group[0], f_val, p_val))

Name: 1, F value: 0.138, p value: 0.871
Name: 2, F value: 1.458, p value: 0.237
Name: 3, F value: 0.742, p value: 0.479
Name: 4, F value: 2.718, p value: 0.071
Name: 5, F value: 0.255, p value: 0.776
Name: 6, F value: 1.731, p value: 0.182
Name: 7, F value: 0.269, p value: 0.764
Name: 8, F value: 0.474, p value: 0.624
Name: 9, F value: 1.226, p value: 0.297

For post-hoc tests, you can use statsmodels (as explained here):

from statsmodels.stats.multicomp import pairwise_tukeyhsd
for name, grouped_df in df.groupby('Names'):
    print('Name {}'.format(name), pairwise_tukeyhsd(grouped_df['value'], grouped_df['condition']))

Name 1 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE    0.0086  -0.5129 0.5301 False 
 NON    YES    0.0084  -0.4817 0.4986 False 
  RE    YES   -0.0002  -0.5217 0.5214 False 
--------------------------------------------
Name 2 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE   -0.0089  -0.5299 0.5121 False 
 NON    YES    0.083   -0.4182 0.5842 False 
  RE    YES    0.0919  -0.4008 0.5846 False 
--------------------------------------------
Name 3 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE    0.2401  -0.3136 0.7938 False 
 NON    YES    0.2765  -0.2903 0.8432 False 
  RE    YES    0.0364  -0.5052 0.578  False 
--------------------------------------------
Name 4 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE    0.0894  -0.5825 0.7613 False 
 NON    YES   -0.0437  -0.7418 0.6544 False 
  RE    YES   -0.1331  -0.6949 0.4287 False 
--------------------------------------------
Name 5 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE   -0.4264  -0.9495 0.0967 False 
 NON    YES    0.0439  -0.4264 0.5142 False 
  RE    YES    0.4703  -0.0155 0.9561 False 
--------------------------------------------
Name 6 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE    0.0649  -0.4971 0.627  False 
 NON    YES    -0.406  -0.9405 0.1285 False 
  RE    YES   -0.4709  -1.0136 0.0717 False 
--------------------------------------------
Name 7 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE    0.3111  -0.2766 0.8988 False 
 NON    YES   -0.1664  -0.7314 0.3987 False 
  RE    YES   -0.4774  -1.0688 0.114  False 
--------------------------------------------
Name 8 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE   -0.0224   -0.668 0.6233 False 
 NON    YES    0.0119   -0.668 0.6918 False 
  RE    YES    0.0343  -0.6057 0.6742 False 
--------------------------------------------
Name 9 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE   -0.2414  -0.7792 0.2963 False 
 NON    YES    0.0696  -0.5746 0.7138 False 
  RE    YES    0.311   -0.3129 0.935  False 

这篇关于使用 scipy 对数据帧内的组进行方差分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆