pandas ,计算许多方法与自举置信区间绘制 [英] Pandas, compute many means with bootstrap confidence intervals for plotting

查看:393
本文介绍了 pandas ,计算许多方法与自举置信区间绘制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为数据帧的某些子集计算具有自举置信区间的方法;最终目标是产生具有自举置信区间的方法的条形图作为误差条。我的数据框架如下所示:

  ATG12标准ATG5标准ATG7规范癌症阶段
5.55 4.99 8.99 IIA
4.87 5.77 8.88 IIA
5.98 7.88 8.34 IIC

我感兴趣的子集是规范列和癌症阶段的每一个组合。我已经设法制作了一个方法表:

  df.groupby('Cancer Stage')['ATG12 Norm' ,'ATG5 Norm','ATG7 Norm']。mean()

但是我需要计算引导使用这些方法中每一种手段的误差条的置信区间: http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/
它归结为:

  import scipy 
import scikits.bootstraps as bootstraps
CI = bootstrap.ci(data = Series,statfunction = scipy.mean)
#CI [0]和CI [1]是您的低和高置信区间

我尝试使用嵌套循环脚本将此方法应用于每个数据子集:

  for i in data.groupby('Cancer Stage'):
for p in i.columns [1:3]:#PROBLEM !!
系列= i [p]
打印p
打印Series.mean()
ci = bootstrap.ci(data = Series,statfunction = scipy.mean)

哪个产生错误消息

  AttributeError:'tuple'对象没有被称为'列'的属性

不知道什么元组是,我有一些阅读要做,但我担心,我目前的嵌套for循环的方法将离开我一些数据结构,我将无法轻松绘制。我是新来的熊猫,所以我不会惊讶地发现,有一个更简单,更容易的方式来生成我正在图形化的数据。任何和所有的帮助将非常感激。

解决方案

您对组对象进行迭代的方式是错误的!当您使用groupby()时,您的数据帧将沿着groupby列中的值分隔,并将这些值作为组名称,形成一个所谓的tuple:
(name,dataforgroup)。用于迭代组对象的正确配方是data.groupby('Cancer Stage')中的名称,组的



$ $ $ $ $ $ $ $
打印名称
在group.columns中的p [0:3]:
...

请详细了解大熊猫的群组功能这里,并浏览 python-reference ,以了解什么元组!



分组数据框和应用一个函数基本上是在一个语句中完成的,使用的是熊猫的 apply 功能: p>

  cols = data.columns [0:2] 
列中的列:
print data.groupby 'Cancer Stage')[col] .apply(lambda x:bootstrap.ci(data = x,statfunction = scipy.mean))

在一行中完成所需,并为您生成(很好的可绘制)系列


编辑
我用自己创建的数据框架对象玩弄:

  df = pd.DataFrame({'A':range(24),'B':list('aabb')* 6,'C':range(15,39) })
在['A','C']中的列:
print df.groupby('B')[col] .apply(lambda x:bootstrap.ci(data = x.values ))

产生如下两个系列:



$ $ $ $ $ $ $ $ $ $ $





$ 21 $,
b [23.4166666667,31.25]


I want to compute means with bootstrap confidence intervals for some subsets of a dataframe; the ultimate goal is to produce bar graphs of the means with bootstrap confidence intervals as the error bars. My data frame looks like this:

ATG12 Norm     ATG5 Norm    ATG7 Norm    Cancer Stage    
5.55           4.99         8.99         IIA
4.87           5.77         8.88         IIA
5.98           7.88         8.34         IIC

The subsets I'm interested in are every combination of Norm columns and cancer stage. I've managed to produce a table of means using:

df.groupby('Cancer Stage')['ATG12 Norm', 'ATG5 Norm', 'ATG7 Norm'].mean()

But I need to compute bootstrap confidence intervals to use as error bars for each of those means using the approach described here: http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/ It boils down to:

import scipy
import scikits.bootstraps as bootstraps
CI = bootstrap.ci(data=Series, statfunction=scipy.mean)
# CI[0] and CI[1] are your low and high confidence intervals

I tried to apply this method to each subset of data with a nested-loop script:

for i in data.groupby('Cancer Stage'):
    for p in i.columns[1:3]: # PROBLEM!!
        Series = i[p]
        print p
        print Series.mean()
        ci = bootstrap.ci(data=Series, statfunction=scipy.mean)

Which produced an error message

AttributeError: 'tuple' object has no attribute called 'columns' 

Not knowing what "tuples" are, I have some reading to do but I'm worried that my current approach of nested for loops will leave me with some kind of data structure I won't be able to easily plot from. I'm new to Pandas so I wouldn't be surprised to find there's a simpler, easier way to produce the data I'm trying to graph. Any and all help will be very much appreciated.

解决方案

The way you iterate over the groupby-object is wrong! When you use groupby(), your data frame is sliced along the values in your groupby-column(s), together with these values as group names, forming a so-called "tuple": (name, dataforgroup). The correct recipe for iterating over groupby-objects is

for name, group in data.groupby('Cancer Stage'):
    print name
    for p in group.columns[0:3]:
    ...

Please read more about the groupby-functionality of pandas here and go through the python-reference in order to understand what tuples are!

Grouping data frames and applying a function is essentially done in one statement, using the apply-functionality of pandas:

cols=data.columns[0:2]
for col in columns:
    print data.groupby('Cancer Stage')[col].apply(lambda x:bootstrap.ci(data=x, statfunction=scipy.mean))

does everything you need in one line, and produces a (nicely plotable) series for you

EDIT: I toyed around with a data frame object I created myself:

df = pd.DataFrame({'A':range(24), 'B':list('aabb') * 6, 'C':range(15,39)})
for col in ['A', 'C']:
    print df.groupby('B')[col].apply(lambda x:bootstrap.ci(data=x.values))

yields two series that look like this:

B
a    [6.58333333333, 14.3333333333]
b                      [8.5, 16.25]

B
a    [21.5833333333, 29.3333333333]
b            [23.4166666667, 31.25]

这篇关于 pandas ,计算许多方法与自举置信区间绘制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆