pandas ,数据框,groupby,std [英] pandas, dataframe, groupby, std

查看:36
本文介绍了 pandas ,数据框,groupby,std的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里的熊猫新手.一个(微不足道的)问题:主机、操作、执行时间.我想按主机分组,然后按主机+操作,计算每个主机的执行时间的标准偏差,然后按主机+操作对.看起来很简单?

它适用于按单列分组:

df出[360]:<class 'pandas.core.frame.DataFrame'>Int64Index:132564 个条目,0 到 132563数据列(共9列):datespecial 132564 非空值主机 132564 非空值idnum 132564 非空值操作 132564 非空值时间 132564 非空值...数据类型:float32(1)、int64(2)、object(6)byhost = df.groupby('host')byhost.std()出[362]:日期特殊编号时间主持人ahost1.test 11946.961952 40367.033852 0.003699host1.test 15484.975077 38206.578115 0.008800host10.test NaN 37644.137631 0.018001...

不错.现在:

byhostandop = df.groupby(['host', 'operation'])byhostandop.std()---------------------------------------------------------------------------ValueError 回溯(最近一次调用)<ipython-input-364-2c2566b866c4>在 <module>()---->1 byhostandop.std()/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in std(self, ddof)386#todo,在cython级别实现?第387话-->第388话389 其他:第 390 章/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc 在 _cython_agg_general(self, how, numeric_only)16151616 def _cython_agg_general(自我,如何,numeric_only=True):->1617 new_blocks = self._cython_agg_blocks(如何,numeric_only=numeric_only)1618 返回 self._wrap_agged_blocks(new_blocks)1619/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_blocks(self, how, numeric_only)1653 个值 = com.ensure_float(values)1654->1655 结果,_ = self.grouper.aggregate(值,如何,轴=agg_axis)16561657 #看看我们是否可以将块转换回原始dtype/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc 聚合(自我,值,如何,轴)第838话839 结果= lib.row_bool_subset(结果,-->840(计数> 0).视图(np.uint8))841 其他:842 结果= lib.row_bool_subset_object(结果,/home/username/anaconda/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.row_bool_subset (pandas/lib.c:16540)()ValueError:缓冲区数据类型不匹配,预期为float64_t"但得到float"

嗯??为什么我会收到此异常?

更多问题:

  • 我如何计算 dataframe.groupby([several columns]) 的标准偏差?

  • 如何将计算限制为选定的列?例如.在这里计算日期/时间戳的 std dev 显然没有意义.

解决方案

了解您的 Pandas/Python 版本很重要.看起来这个异常可能出现在 Pandas 版本中 <0.10(见 ValueError: Buffer dtype mismatch, expected 'float64_t' 但得到了 'float').为避免这种情况,您可以将 float 列转换为 float64:

df.astype('float64')

要在选定的列上计算 std(),只需选择列:)

<预><代码>>>>df = pd.DataFrame({'a':range(10), 'b':range(10,20), 'c':list('abcdefghij'), 'g':[1]*3 + [2]*3 + [3]*4})>>>dfa b c g0 0 10 11 1 11 b 12 2 12 c 13 3 13 d 24 4 14 25 5 15 f 26 6 16 克 37 7 17 小时 38 8 18 我 39 9 19 3>>>df.groupby('g')[['a', 'b']].std()乙G1 1.000000 1.0000002 1.000000 1.0000003 1.290994 1.290994

更新

就目前而言,看起来 std() 正在对 groupby 结果调用 aggregation(),还有一个微妙的错误(参见此处 - Python Pandas:使用聚合与应用于定义新列).为了避免这种情况,您可以使用 apply():

byhostandop['time'].apply(lambda x: x.std())

New to pandas here. A (trivial) problem: hosts, operations, execution times. I want to group by host, then by host+operation, calculate std deviation for execution time per host, then by host+operation pair. Seems simple?

It works for grouping by a single column:

df
Out[360]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 132564 entries, 0 to 132563
Data columns (total 9 columns):
datespecial    132564  non-null values
host           132564  non-null values
idnum          132564  non-null values
operation      132564  non-null values
time           132564  non-null values
...
dtypes: float32(1), int64(2), object(6)



byhost = df.groupby('host')


byhost.std()
Out[362]:
                 datespecial         idnum      time
host
ahost1.test  11946.961952  40367.033852  0.003699
host1.test   15484.975077  38206.578115  0.008800
host10.test           NaN  37644.137631  0.018001
...

Nice. Now:

byhostandop = df.groupby(['host', 'operation'])

byhostandop.std()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-364-2c2566b866c4> in <module>()
----> 1 byhostandop.std()

/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in std(self, ddof)
    386         # todo, implement at cython level?
    387         if ddof == 1:
--> 388             return self._cython_agg_general('std')
    389         else:
    390             f = lambda x: x.std(ddof=ddof)

/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_general(self, how, numeric_only)
   1615
   1616     def _cython_agg_general(self, how, numeric_only=True):
-> 1617         new_blocks = self._cython_agg_blocks(how, numeric_only=numeric_only)
   1618         return self._wrap_agged_blocks(new_blocks)
   1619

/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_blocks(self, how, numeric_only)
   1653                 values = com.ensure_float(values)
   1654
-> 1655             result, _ = self.grouper.aggregate(values, how, axis=agg_axis)
   1656
   1657             # see if we can cast the block back to the original dtype

/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, values, how, axis)
    838                 if is_numeric:
    839                     result = lib.row_bool_subset(result,
--> 840                                                  (counts > 0).view(np.uint8))
    841                 else:
    842                     result = lib.row_bool_subset_object(result,

/home/username/anaconda/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.row_bool_subset (pandas/lib.c:16540)()

ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'

Huh?? Why do I get this exception?

More questions:

  • how do I calculate std deviation on dataframe.groupby([several columns])?

  • how can I limit calculation to a selected column? E.g. it obviously doesn't make sense to calculate std dev on dates/timestamps here.

解决方案

It's important to know your version of Pandas / Python. Looks like this exception could arise in Pandas version < 0.10 (see ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'). To avoid this, you can cast your float columns to float64:

df.astype('float64')

To calculate std() on selected columns, just select columns :)

>>> df = pd.DataFrame({'a':range(10), 'b':range(10,20), 'c':list('abcdefghij'), 'g':[1]*3 + [2]*3 + [3]*4})
>>> df
   a   b  c  g
0  0  10  a  1
1  1  11  b  1
2  2  12  c  1
3  3  13  d  2
4  4  14  e  2
5  5  15  f  2
6  6  16  g  3
7  7  17  h  3
8  8  18  i  3
9  9  19  j  3
>>> df.groupby('g')[['a', 'b']].std()
          a         b
g                    
1  1.000000  1.000000
2  1.000000  1.000000
3  1.290994  1.290994

update

As far as it goes, it looks like std() is calling aggregation() on the groupby result, and a subtle bug (see here - Python Pandas: Using Aggregate vs Apply to define new columns). To avoid this, you can use apply():

byhostandop['time'].apply(lambda x: x.std())

这篇关于 pandas ,数据框,groupby,std的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆