使用MultiIndex列在 pandas 数据框中添加一个字段 [英] add a field in pandas dataframe with MultiIndex columns
问题描述
我已经找到了这个问题的答案,因为它似乎很简单,但还没有找到任何东西。道歉,如果我错过了一些东西。我的大熊猫版本为0.10.0,我一直在尝试以下形式的数据:
import pandas
import numpy as np
import datetime
start_date = datetime.datetime(2009,3,1,6,29,59)
r = pandas.date_range(start_date,periods = 12)
cols_1 = ['AAPL','AAPL','GOOG','GOOG','GS','GS']
cols_2 = ['close','rate','close','rate' 'close','rate']
dat = np.random.randn(12,6)
cols = pandas.MultiIndex.from_arrays([cols_1,cols_2],names = ['ticker','字段'])
dftst = pandas.DataFrame(dat,columns = cols,index = r)
打印dftst
代码AAPL GOOG GS
字段关闭率关闭率关闭率
2009-03-01 06:29:59 1.956255 -2.074371 -0.200568 0.759772 -0.951543 0.514577
2009-03-02 06:29:59 0.069611 - 2.684352 -0.310006 0.730205 -0。 302949 -0.830452
2009-03-03 06:29:59 2.077130 -0.903784 0.449857 -1.357464 -0.469572 -0.008757
2009-03-04 06:29:59 1.585358 -2.063672 0.600889 -1.741606 -0.299875 0.565253
2009-03-05 06:29:59 0.269123 0.226593 1.132663 0.485035 0.796858 -0.423112
2009-03-06 06:29:59 0.094879 -1.040069 0.613450 -0.175266 -0.065172 3.374658
2009- 03-07 06:29:59 -1.255167 -0.326474 0.437053 -0.231594 0.437703 -0.256811
2009-03-08 06:29:59 0.115454 -1.096841 -1.189211 -0.208098 -0.807860 0.158198
2009-03- 09 06:29:59 2.142816 0.173878 -0.160932 0.367309 -0.449765 -0.325400
2009-03-10 06:29:59 0.470669 -0.346805 1.152648 0.844632 1.031602 -0.012502
2009-03-11 06:29: 59 -1.366954 0.452177 0.010713 -1.331553 0.226781 0.456900
2009-03-12 06:29:59 2.182409 0.890023 -0.627318 -1.516574 -1.565416 -0.694320
如您所见,我正在尝试表示3d倍数据。所以我有一个时间序列索引和MultiIndex列。我很乐意切片数据。如果我只想要一个尾随的数据,我可以执行以下操作:
pandas.rolling_mean(dftst.ix [ :,:: 2],5)
代码AAPL GOOG GS
字段关闭关闭
2009-03-01 06:29:59 NaN NaN NaN
2009-03-02 06:29:59 NaN NaN NaN
2009-03-03 06:29:59 NaN NaN NaN
2009-03-04 06:29:59 NaN NaN NaN
2009-03-05 06:29:59 0.410966 -0.412356 0.722951
2009-03-06 06:29:59 -0.103187 -0.497165 0.137731
2009-03-07 06:29 :59 0.000194 -0.645375 -0.298504
2009-03-08 06:29:59 -0.074036 -0.541717 -0.035906
2009-03-09 06:29:59 -0.391863 -0.671918 -0.554380
2009-03-10 06:29:59 -0.336397 -0.411845 -0.992615
2009-03-11 06:29:59 -0.251645 -0.289512 -0.458246
2009-03-12 06:29: 59 -0.138925 0.244572 -0.230743
我不能做的是创建一个新的字段,l ike avg_close并分配给它。理想情况下,我想做一些类似如下的操作:
dftst [:,'avg_close'] = pandas.rolling_mean(dftst.ix [:,:: 2] 5)
即使我交换了MultiIndex的级别,我无法使其工作:
dftst = dftst.swaplevel(1,0,axis = 1)
/ pre>
print dftst ['close']
代码AAPL GOOG GS
2009- 03-01 06:29:59 1.178557 -0.505672 -0.336645
2009-03-02 06:29:59 0.234305 0.581429 -0.232252
2009-03-03 06:29:59 -0.734798 0.117810 1.658418
2009-03-04 06:29:59 -1.555033 -0.298322 0.127408
2009-03-05 06:29:59 0.244102 -1.030041 -0.562039
2009-03-06 06:29: 59-0.297454 1.150564 -1.930883
2009-03-07 06:29:59 0.818910 -0.905296 1.219946
2009-03-08 06:29:59 0.586816 0.965242 0.928546
2009-03-09 06:29:59 -0.357693 0.071455 0.072956
2009-03-10 06:29:59 0.651803 -0.685937 0.805779
2009-03-11 06:29:59 0.569802 -0.062447 -1.349261
2009-03-1 2 06:29:59 -1.886335 0.205778 -0.864273
dftst ['avg_close'] = pandas.rolling_mean(dftst ['close'],3)
----> 1 dftst ['avg_close'] = pandas.rolling_mean(dftst ['close'],3)
/usr/local/lib/python2.7/dist-packages/pandas/core/frame。 pyc in
__setitem __(self,key,value)2041 else:2042#set column
- > 2043 self._set_item(key,value)2044 2045 def _boolean_set(self,key,value):
/usr/local/lib/python2.7/dist-packages/pandas/core/frame。 pyc in
_set_item(self,key,value)20772078 value = self._sanitize_column(key,value)
- > 2079 NDFrame._set_item(self,key,value)2080 2081 def insert(self,loc,column,value):
/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc在
_set_item(self, key,value)
544
545 def _set_item(self,key,value):
- > 546 self._data.set(key,value)
547 self。 _clear_item_cache()
548
/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in set(self,item,value)
951除了KeyError:
952#insert at end
- > 953 self.insert(len(self.items),item,value)
954
955 self._known_conso lidate = False
/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in insert(self,loc,item,value)
963
964#new block
- > 965 self._add_new_block(item,value,loc = loc)
966
967 if len(self.blocks)>
$ b /usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc
_add_new_block(self,item,value,loc)
992 loc = self.items.get_loc(item)
993 new_block = make_block(value,self.items [loc:loc + 1] .copy(),
- > 994 self.items )
995 self.blocks.append(new_block)
996
/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc make_block(values,items,ref_items)
463 klass = ObjectBlock
464
- > 465 return klass(values,items,ref_items,ndim = values.ndim)
466
467#TODO:flexible with index = None and / or items = None
/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in
__init __(self,values,items,ref_items,ndim)
30 if len(items )= len(values):
31 raise AssertionError('错误的项目数(%d vs%d)'
---> 32%(len(items),len ))
33
34 self._ref_locs =无
AssertionError:通过的项目数不正确(1对3)
如果我的列不是MultiIndex,我可以指定执行以下操作:
start_date = datetime.datetime(2009,3,1,6,29,59)
r = pandas.date_range(start_date,periods = 12)
cols = ['AAPL',' GOOG','GS']
dat = np.random.randn(12,3)
dftst2 = pandas.DataFrame(dat,columns = cols,index = r)
打印dftst2
AAPL GOOG GS
2009-03-01 06:29:59 2.476787 2.386037 -0.777566
2009-03-02 06:29:59 -0.820647 1.006159 -0.590240
2009-03-03 06:29:59 0.433960 0.104458 0.282641
2009-03-04 06:29:59 0.300190 -0.300786 -1.780412
2009-03-05 06:29:59 -0.247919 1.616572 1.145594
2009-03-06 06:29:59 -0.779130 0.695256 0.845819
2009-03-07 06:29:59 0.572073 0.349394 -3.557776
2009-03-08 06 :29:59 2.019885 0.358346 1.350812
2009-03-09 06:29:59 0.472328 -0.334223 -0.605862
2009-03-10 06:29:59 -1.570479 0.410808 0.616515
2009- 03-11 06:29:59 1.177562 -0.240396 -2.126951
2009-03-12 06:29:59 0.311566 -1.743213 0.382617
要添加一个字段,基于另一个字段,我可以执行以下操作:
dftst2 ['GOOG_avg'] = pandas.rolling_mean(dftst2 ['GOOG'],3)
print dftst2
AAPL GOO G GS GOOG_avg
2009-03-01 06:29:59 2.476787 2.386037 -0.777566 NaN
2009-03-02 06:29:59 -0.820647 1.006159 -0.590240 NaN
2009-03- 03 06:29:59 0.433960 0.104458 0.282641 1.165551
2009-03-04 06:29:59 0.300190 -0.300786 -1.780412 0.269944
2009-03-05 06:29:59 -0.247919 1.616572 1.145594 0.473415
2009-03-06 06:29:59 -0.779130 0.695256 0.845819 0.670347
2009-03-07 06:29:59 0.572073 0.349394 -3.557776 0.887074
2009-03-08 06:29: 59 2.019885 0.358346 1.350812 0.467666
2009-03-09 06:29:59 0.472328 -0.334223 -0.605862 0.124506
2009-03-10 06:29:59 -1.570479 0.410808 0.616515 0.144977
2009- 03-11 06:29:59 1.177562 -0.240396 -2.126951 -0.054604
2009-03-12 06:29:59 0.311566 -1.743213 0.382617 -0.524267
我已经尝试使用Panel对象,但到目前为止还没有找到一个快速的方法来添加一个我有MultiIndex列的字段,理想的是其他级别的列广播如果有其他帖子回答了这个问题,我很抱歉。任何建议将不胜感激。
解决方案你也可以(作为一种解决方法,因为没有一个真正的API你想要什么)考虑一点重塑,如果你不想使用面板。我不会推荐它在巨大的数据集,虽然:使用面板为此。
在[30]中:df = dftst.stack(0)
在[31] df ['close_avg'] = pd.rolling_mean(df.close.unstack(),5).stack()
在[32]中:df
输出[32]:
field close rate close_avg
ticker
2009-03-01 06:29:59 AAPL -0.223042 0.554996 NaN
GOOG 0.060127 -0.333992 NaN
GS 0.117626 -1.256790 NaN
2009-03-02 06:29:59 AAPL -0.513743 -0.402661 NaN
GOOG 0.059828 -0.125288 NaN
GS -0.336196 -0.510595 NaN
2009-03-03 06:29: 59 AAPL 0.142202 -1.038470 NaN
GOOG -1.099251 -0.892581 NaN
GS 1.698086 0.885023 NaN
2009-03-04 06:29:59 AAPL -1.125821 0.413005 NaN
GOOG 0.424290 1.106983 N aN
GS 0.047158 0.680714 NaN
2009-03-05 06:29:59 AAPL 0.470050 1.845354 -0.250071
GOOG 0.132956 -0.488800 -0.084410
GS 0.129190 0.208077 0.331173
2009-03-06 06:29:59 AAPL -0.087360 -2.102512 -0.222934
GOOG 0.165100 -0.134886 -0.063415
GS 0.167720 0.082480 0.341192
2009-03-07 06:29:59 AAPL -0.768542 -0.176076 -0.273894
GOOG 0.417694 2.257074 0.008158
GS -1.744730 -1.850185 0.059485
2009-03-08 06:29:59 AAPL -0.297363 -0.633828 -0.361807
GOOG -1.096703 -0.572138 0.008667
GS 0.890016 -2.621563 -0.102129
2009-03-09 06:29:59 AAPL 1.038579 0.053330 0.071073
谷歌-0.614050 0.607944 -0.199001
GS -0.882848 0.596801 -0.288130
2009-03-10 06:29:59 AAPL -0.255226 0.058178 -0.073982
GOOG 1.761861 1.841751 0.126780
GS -0.549998 -1.551281 -0.423968
2009-03-11 06:29:59 AAPL 0.413522 0.149089 0.026194
GOOG -2.964163 1.825312 -0.499072
GS -0.373303 1.137001 -0.532173
2009-03-12 06:29:59 AAPL -0.924776 1.238546 -0.005053
GOOG -0.985956 -0.906590 -0.779802
GS -0.320400 1.239681 -0.247307
i have looked for an answer to this question as it seems pretty simple, but have not been able to find anything yet. Apologies if I missed something. I have pandas version 0.10.0 and I have been experimenting with data of the following form:
import pandas import numpy as np import datetime start_date = datetime.datetime(2009,3,1,6,29,59) r = pandas.date_range(start_date, periods=12) cols_1 = ['AAPL', 'AAPL', 'GOOG', 'GOOG', 'GS', 'GS'] cols_2 = ['close', 'rate', 'close', 'rate', 'close', 'rate'] dat = np.random.randn(12, 6) cols = pandas.MultiIndex.from_arrays([cols_1, cols_2], names=['ticker','field']) dftst = pandas.DataFrame(dat, columns=cols, index=r) print dftst ticker AAPL GOOG GS field close rate close rate close rate 2009-03-01 06:29:59 1.956255 -2.074371 -0.200568 0.759772 -0.951543 0.514577 2009-03-02 06:29:59 0.069611 -2.684352 -0.310006 0.730205 -0.302949 -0.830452 2009-03-03 06:29:59 2.077130 -0.903784 0.449857 -1.357464 -0.469572 -0.008757 2009-03-04 06:29:59 1.585358 -2.063672 0.600889 -1.741606 -0.299875 0.565253 2009-03-05 06:29:59 0.269123 0.226593 1.132663 0.485035 0.796858 -0.423112 2009-03-06 06:29:59 0.094879 -1.040069 0.613450 -0.175266 -0.065172 3.374658 2009-03-07 06:29:59 -1.255167 -0.326474 0.437053 -0.231594 0.437703 -0.256811 2009-03-08 06:29:59 0.115454 -1.096841 -1.189211 -0.208098 -0.807860 0.158198 2009-03-09 06:29:59 2.142816 0.173878 -0.160932 0.367309 -0.449765 -0.325400 2009-03-10 06:29:59 0.470669 -0.346805 1.152648 0.844632 1.031602 -0.012502 2009-03-11 06:29:59 -1.366954 0.452177 0.010713 -1.331553 0.226781 0.456900 2009-03-12 06:29:59 2.182409 0.890023 -0.627318 -1.516574 -1.565416 -0.694320
As you can see, I am trying to represent 3d timeseries data. So I have a timeseries index and MultiIndex columns. I am pretty comfortable with slicing the data. If I wanted just a trailing mean of the close data, I can do the following:
pandas.rolling_mean(dftst.ix[:,::2], 5) ticker AAPL GOOG GS field close close close 2009-03-01 06:29:59 NaN NaN NaN 2009-03-02 06:29:59 NaN NaN NaN 2009-03-03 06:29:59 NaN NaN NaN 2009-03-04 06:29:59 NaN NaN NaN 2009-03-05 06:29:59 0.410966 -0.412356 0.722951 2009-03-06 06:29:59 -0.103187 -0.497165 0.137731 2009-03-07 06:29:59 0.000194 -0.645375 -0.298504 2009-03-08 06:29:59 -0.074036 -0.541717 -0.035906 2009-03-09 06:29:59 -0.391863 -0.671918 -0.554380 2009-03-10 06:29:59 -0.336397 -0.411845 -0.992615 2009-03-11 06:29:59 -0.251645 -0.289512 -0.458246 2009-03-12 06:29:59 -0.138925 0.244572 -0.230743
What I cannot do is create a new field, like avg_close and assign to it. Ideally I would like to do something like the following:
dftst[:,'avg_close'] = pandas.rolling_mean(dftst.ix[:,::2], 5)
Even if I swap the levels of my MultiIndex, I cannot make it work:
dftst = dftst.swaplevel(1,0,axis=1) print dftst['close'] ticker AAPL GOOG GS 2009-03-01 06:29:59 1.178557 -0.505672 -0.336645 2009-03-02 06:29:59 0.234305 0.581429 -0.232252 2009-03-03 06:29:59 -0.734798 0.117810 1.658418 2009-03-04 06:29:59 -1.555033 -0.298322 0.127408 2009-03-05 06:29:59 0.244102 -1.030041 -0.562039 2009-03-06 06:29:59 -0.297454 1.150564 -1.930883 2009-03-07 06:29:59 0.818910 -0.905296 1.219946 2009-03-08 06:29:59 0.586816 0.965242 0.928546 2009-03-09 06:29:59 -0.357693 0.071455 0.072956 2009-03-10 06:29:59 0.651803 -0.685937 0.805779 2009-03-11 06:29:59 0.569802 -0.062447 -1.349261 2009-03-12 06:29:59 -1.886335 0.205778 -0.864273 dftst['avg_close'] = pandas.rolling_mean(dftst['close'], 3) ----> 1 dftst['avg_close'] = pandas.rolling_mean(dftst['close'], 3) /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __setitem__(self, key, value) 2041 else: 2042 # set column -> 2043 self._set_item(key, value) 2044 2045 def _boolean_set(self, key, value): /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _set_item(self, key, value) 2077 """ 2078 value = self._sanitize_column(key, value) -> 2079 NDFrame._set_item(self, key, value) 2080 2081 def insert(self, loc, column, value): /usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in _set_item(self, key, value) 544 545 def _set_item(self, key, value): --> 546 self._data.set(key, value) 547 self._clear_item_cache() 548 /usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in set(self, item, value) 951 except KeyError: 952 # insert at end --> 953 self.insert(len(self.items), item, value) 954 955 self._known_consolidated = False /usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in insert(self, loc, item, value) 963 964 # new block --> 965 self._add_new_block(item, value, loc=loc) 966 967 if len(self.blocks) > 100: /usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in _add_new_block(self, item, value, loc) 992 loc = self.items.get_loc(item) 993 new_block = make_block(value, self.items[loc:loc+1].copy(), --> 994 self.items) 995 self.blocks.append(new_block) 996 /usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in make_block(values, items, ref_items) 463 klass = ObjectBlock 464 --> 465 return klass(values, items, ref_items, ndim=values.ndim) 466 467 # TODO: flexible with index=None and/or items=None /usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in __init__(self, values, items, ref_items, ndim) 30 if len(items) != len(values): 31 raise AssertionError('Wrong number of items passed (%d vs %d)' ---> 32 % (len(items), len(values))) 33 34 self._ref_locs = None AssertionError: Wrong number of items passed (1 vs 3)
If my columns were not MultiIndex, I could assign doing the following:
start_date = datetime.datetime(2009,3,1,6,29,59) r = pandas.date_range(start_date, periods=12) cols = ['AAPL', 'GOOG', 'GS'] dat = np.random.randn(12, 3) dftst2 = pandas.DataFrame(dat, columns=cols, index=r) print dftst2 AAPL GOOG GS 2009-03-01 06:29:59 2.476787 2.386037 -0.777566 2009-03-02 06:29:59 -0.820647 1.006159 -0.590240 2009-03-03 06:29:59 0.433960 0.104458 0.282641 2009-03-04 06:29:59 0.300190 -0.300786 -1.780412 2009-03-05 06:29:59 -0.247919 1.616572 1.145594 2009-03-06 06:29:59 -0.779130 0.695256 0.845819 2009-03-07 06:29:59 0.572073 0.349394 -3.557776 2009-03-08 06:29:59 2.019885 0.358346 1.350812 2009-03-09 06:29:59 0.472328 -0.334223 -0.605862 2009-03-10 06:29:59 -1.570479 0.410808 0.616515 2009-03-11 06:29:59 1.177562 -0.240396 -2.126951 2009-03-12 06:29:59 0.311566 -1.743213 0.382617
To add a field, based on another field, I can do the following:
dftst2['GOOG_avg'] = pandas.rolling_mean(dftst2['GOOG'], 3) print dftst2 AAPL GOOG GS GOOG_avg 2009-03-01 06:29:59 2.476787 2.386037 -0.777566 NaN 2009-03-02 06:29:59 -0.820647 1.006159 -0.590240 NaN 2009-03-03 06:29:59 0.433960 0.104458 0.282641 1.165551 2009-03-04 06:29:59 0.300190 -0.300786 -1.780412 0.269944 2009-03-05 06:29:59 -0.247919 1.616572 1.145594 0.473415 2009-03-06 06:29:59 -0.779130 0.695256 0.845819 0.670347 2009-03-07 06:29:59 0.572073 0.349394 -3.557776 0.887074 2009-03-08 06:29:59 2.019885 0.358346 1.350812 0.467666 2009-03-09 06:29:59 0.472328 -0.334223 -0.605862 0.124506 2009-03-10 06:29:59 -1.570479 0.410808 0.616515 0.144977 2009-03-11 06:29:59 1.177562 -0.240396 -2.126951 -0.054604 2009-03-12 06:29:59 0.311566 -1.743213 0.382617 -0.524267
I have tried using a Panel object, but so far have not found a quick way to add a field where I have MultiIndex columns, ideally the other level of the columns would be broadcast. I apologize if there have been other posts that answer this question. Any suggestions would be much appreciated.
解决方案You could also (as a workaround since there isn't really an API that does exactly what you want ) consider a bit of reshaping-fu if you don't want to use a Panel. I wouldn't recommend it on enormous data sets, though: use a Panel for that.
In [30]: df = dftst.stack(0) In [31]: df['close_avg'] = pd.rolling_mean(df.close.unstack(), 5).stack() In [32]: df Out[32]: field close rate close_avg ticker 2009-03-01 06:29:59 AAPL -0.223042 0.554996 NaN GOOG 0.060127 -0.333992 NaN GS 0.117626 -1.256790 NaN 2009-03-02 06:29:59 AAPL -0.513743 -0.402661 NaN GOOG 0.059828 -0.125288 NaN GS -0.336196 -0.510595 NaN 2009-03-03 06:29:59 AAPL 0.142202 -1.038470 NaN GOOG -1.099251 -0.892581 NaN GS 1.698086 0.885023 NaN 2009-03-04 06:29:59 AAPL -1.125821 0.413005 NaN GOOG 0.424290 1.106983 NaN GS 0.047158 0.680714 NaN 2009-03-05 06:29:59 AAPL 0.470050 1.845354 -0.250071 GOOG 0.132956 -0.488800 -0.084410 GS 0.129190 0.208077 0.331173 2009-03-06 06:29:59 AAPL -0.087360 -2.102512 -0.222934 GOOG 0.165100 -0.134886 -0.063415 GS 0.167720 0.082480 0.341192 2009-03-07 06:29:59 AAPL -0.768542 -0.176076 -0.273894 GOOG 0.417694 2.257074 0.008158 GS -1.744730 -1.850185 0.059485 2009-03-08 06:29:59 AAPL -0.297363 -0.633828 -0.361807 GOOG -1.096703 -0.572138 0.008667 GS 0.890016 -2.621563 -0.102129 2009-03-09 06:29:59 AAPL 1.038579 0.053330 0.071073 GOOG -0.614050 0.607944 -0.199001 GS -0.882848 0.596801 -0.288130 2009-03-10 06:29:59 AAPL -0.255226 0.058178 -0.073982 GOOG 1.761861 1.841751 0.126780 GS -0.549998 -1.551281 -0.423968 2009-03-11 06:29:59 AAPL 0.413522 0.149089 0.026194 GOOG -2.964163 1.825312 -0.499072 GS -0.373303 1.137001 -0.532173 2009-03-12 06:29:59 AAPL -0.924776 1.238546 -0.005053 GOOG -0.985956 -0.906590 -0.779802 GS -0.320400 1.239681 -0.247307
这篇关于使用MultiIndex列在 pandas 数据框中添加一个字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!