使用 scipy 对数据帧内的组进行方差分析 [英] ANOVA for groups within a dataframe using scipy
问题描述
我有一个如下的数据框.我需要在三个条件之间对此进行方差分析.数据框看起来像:
data0 = pd.DataFrame({'Names': ['CTA15', 'CTA15', 'AC007', 'AC007', 'AC007','AC007'],'值': [22, 22, 2, 2, 2,5],'条件':['NON', 'NON', 'YES', 'YES', 'RE','RE']})
我需要在 YES 和 NON、NON 和 RE 以及 YES 和 RE 之间进行方差分析,这些条件来自名称的条件.我知道我可以这样做,
NON=df.query('condition =="NON"and Names=="CTA15"')没有= df.valueYES=df.query('condition =="YES"and Names=="CTA15"')Y=YES.value
然后执行如下单向方差分析,
from scipy import statsf_val, p_val = stats.f_oneway(no, Y)打印(单向方差分析 P =",p_val)
但是如果有任何优雅的解决方案会很棒,因为我的初始数据框很大并且有很多名称和条件可供比较
考虑以下示例 DataFrame:
df = pd.DataFrame({'Names': np.random.randint(1, 10, 1000),'值':np.random.randn(1000),'条件': np.random.choice(['NON', 'YES', 'RE'], 1000)})df.head()出去:名称条件值0 4 回复 0.8441201 4 非 -0.4402852 5 是 0.5594973 4 RE 0.4724254 9 是 0.205906
以下按名称对 DataFrame 进行分组,然后将每个条件组传递给 ANOVA:
将 scipy.stats 导入为 ss对于 df.groupby('Names') 中的 name_group:samples = [condition[1] for name_group[1].groupby('condition')['value']] 中的条件f_val, p_val = ss.f_oneway(*samples)print('名称:{},F 值:{:.3f},p 值:{:.3f}'.format(name_group[0], f_val, p_val))名称:1,F值:0.138,p值:0.871名称:2,F值:1.458,p值:0.237名称:3,F值:0.742,p值:0.479名称:4,F值:2.718,p值:0.071名称:5,F 值:0.255,p 值:0.776名称:6,F值:1.731,p值:0.182名称:7,F值:0.269,p值:0.764名称:8,F值:0.474,p值:0.624名称:9,F 值:1.226,p 值:0.297
对于事后测试,您可以使用 statsmodels(如此处所述):
from statsmodels.stats.multicomp import pairwise_tukeyhsd对于名称,df.groupby('Names') 中的 grouped_df:print('Name {}'.format(name), pairwise_tukeyhsd(grouped_df['value'], grouped_df['condition']))
<前>名称 1 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非 RE 0.0086 -0.5129 0.5301 错误非 是 0.0084 -0.4817 0.4986 错误RE 是 -0.0002 -0.5217 0.5214 错误-----------------------------------------名称 2 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非 RE -0.0089 -0.5299 0.5121 错误否 是 0.083 -0.4182 0.5842 错误RE 是 0.0919 -0.4008 0.5846 错误-----------------------------------------名称 3 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非 RE 0.2401 -0.3136 0.7938 错误非 是 0.2765 -0.2903 0.8432 错误RE 是 0.0364 -0.5052 0.578 错误-----------------------------------------名称 4 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非 RE 0.0894 -0.5825 0.7613 错误非 是 -0.0437 -0.7418 0.6544 假RE 是 -0.1331 -0.6949 0.4287 错误-----------------------------------------名称 5 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非 RE -0.4264 -0.9495 0.0967 错误否 是 0.0439 -0.4264 0.5142 错误RE 是 0.4703 -0.0155 0.9561 错误-----------------------------------------名称 6 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非 RE 0.0649 -0.4971 0.627 错误非 是 -0.406 -0.9405 0.1285 假RE 是 -0.4709 -1.0136 0.0717 错误-----------------------------------------名称 7 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非 RE 0.3111 -0.2766 0.8988 错误非 是 -0.1664 -0.7314 0.3987 假RE 是 -0.4774 -1.0688 0.114 错误-----------------------------------------名称 8 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非 RE -0.0224 -0.668 0.6233 错误否 是 0.0119 -0.668 0.6918 错误RE 是 0.0343 -0.6057 0.6742 错误-----------------------------------------名称 9 均值的多重比较 - Tukey HSD,FWER=0.05============================================group1 group2 meandiff 下上拒绝-----------------------------------------非重新 -0.2414 -0.7792 0.2963 错误非 是 0.0696 -0.5746 0.7138 错误RE 是 0.311 -0.3129 0.935 错误
I have a dataframe as follows. I need to do ANOVA on this between three conditions. The dataframe looks like:
data0 = pd.DataFrame({'Names': ['CTA15', 'CTA15', 'AC007', 'AC007', 'AC007','AC007'],
'value': [22, 22, 2, 2, 2,5],
'condition':['NON', 'NON', 'YES', 'YES', 'RE','RE']})
I need to do ANOVA test between YES and NON, NON and RE and YES and RE, conditions from conditions for Names. I know I could do it like this,
NON=df.query('condition =="NON"and Names=="CTA15"')
no=df.value
YES=df.query('condition =="YES"and Names=="CTA15"')
Y=YES.value
Then perform one way ANOVA as following,
from scipy import stats
f_val, p_val = stats.f_oneway(no, Y)
print ("One-way ANOVA P =", p_val )
But would be great if there is any elegant solution as my initial data frame is big and has many names and conditions to compare between
Consider the following sample DataFrame:
df = pd.DataFrame({'Names': np.random.randint(1, 10, 1000),
'value': np.random.randn(1000),
'condition': np.random.choice(['NON', 'YES', 'RE'], 1000)})
df.head()
Out:
Names condition value
0 4 RE 0.844120
1 4 NON -0.440285
2 5 YES 0.559497
3 4 RE 0.472425
4 9 YES 0.205906
The following groups the DataFrame by Names, and then passes each condition group to ANOVA:
import scipy.stats as ss
for name_group in df.groupby('Names'):
samples = [condition[1] for condition in name_group[1].groupby('condition')['value']]
f_val, p_val = ss.f_oneway(*samples)
print('Name: {}, F value: {:.3f}, p value: {:.3f}'.format(name_group[0], f_val, p_val))
Name: 1, F value: 0.138, p value: 0.871
Name: 2, F value: 1.458, p value: 0.237
Name: 3, F value: 0.742, p value: 0.479
Name: 4, F value: 2.718, p value: 0.071
Name: 5, F value: 0.255, p value: 0.776
Name: 6, F value: 1.731, p value: 0.182
Name: 7, F value: 0.269, p value: 0.764
Name: 8, F value: 0.474, p value: 0.624
Name: 9, F value: 1.226, p value: 0.297
For post-hoc tests, you can use statsmodels (as explained here):
from statsmodels.stats.multicomp import pairwise_tukeyhsd
for name, grouped_df in df.groupby('Names'):
print('Name {}'.format(name), pairwise_tukeyhsd(grouped_df['value'], grouped_df['condition']))
Name 1 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================ group1 group2 meandiff lower upper reject -------------------------------------------- NON RE 0.0086 -0.5129 0.5301 False NON YES 0.0084 -0.4817 0.4986 False RE YES -0.0002 -0.5217 0.5214 False -------------------------------------------- Name 2 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================ group1 group2 meandiff lower upper reject -------------------------------------------- NON RE -0.0089 -0.5299 0.5121 False NON YES 0.083 -0.4182 0.5842 False RE YES 0.0919 -0.4008 0.5846 False -------------------------------------------- Name 3 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================ group1 group2 meandiff lower upper reject -------------------------------------------- NON RE 0.2401 -0.3136 0.7938 False NON YES 0.2765 -0.2903 0.8432 False RE YES 0.0364 -0.5052 0.578 False -------------------------------------------- Name 4 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================ group1 group2 meandiff lower upper reject -------------------------------------------- NON RE 0.0894 -0.5825 0.7613 False NON YES -0.0437 -0.7418 0.6544 False RE YES -0.1331 -0.6949 0.4287 False -------------------------------------------- Name 5 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================ group1 group2 meandiff lower upper reject -------------------------------------------- NON RE -0.4264 -0.9495 0.0967 False NON YES 0.0439 -0.4264 0.5142 False RE YES 0.4703 -0.0155 0.9561 False -------------------------------------------- Name 6 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================ group1 group2 meandiff lower upper reject -------------------------------------------- NON RE 0.0649 -0.4971 0.627 False NON YES -0.406 -0.9405 0.1285 False RE YES -0.4709 -1.0136 0.0717 False -------------------------------------------- Name 7 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================ group1 group2 meandiff lower upper reject -------------------------------------------- NON RE 0.3111 -0.2766 0.8988 False NON YES -0.1664 -0.7314 0.3987 False RE YES -0.4774 -1.0688 0.114 False -------------------------------------------- Name 8 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================ group1 group2 meandiff lower upper reject -------------------------------------------- NON RE -0.0224 -0.668 0.6233 False NON YES 0.0119 -0.668 0.6918 False RE YES 0.0343 -0.6057 0.6742 False -------------------------------------------- Name 9 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================ group1 group2 meandiff lower upper reject -------------------------------------------- NON RE -0.2414 -0.7792 0.2963 False NON YES 0.0696 -0.5746 0.7138 False RE YES 0.311 -0.3129 0.935 False
这篇关于使用 scipy 对数据帧内的组进行方差分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!