Python Pandas - 发布多索引数据帧 [英] Python Pandas - Issue concat multi-indexed Dataframes

查看:407
本文介绍了Python Pandas - 发布多索引数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试合并两个MultiIndex的数据帧。我的代码如下您可以在输出中看到的问题是,DATE索引重复,而我希望所有的值(OPEN_INT,PX_LAST)在同一个日期索引...任何想法?我已经尝试过追加和连接,但都给了我类似的结果。

 如果df.empty:
df = bbg_historicaldata(t,f,startDate,endDate)
print(df)
datesArray = list(df.index)
tArray = [t for i in range(len(datesArray))]
数组= [tArray,datesArray]
tuples = list(zip(*数组))
index = pd.MultiIndex.from_tuples(tuples,names = ['TICKER','DATE'] )
df = pd.DataFrame({f:df [f] .values},index = index)
else:
temp = bbg_historicaldata(t,f,startDate,endDate)
print(temp)
datesArray = list(temp.index)
tArray = [t for i in range(len(datesArray))]
数组= [tArray,datesArray]
tuples = list(zip(* arrays))
index = pd.MultiIndex.from_tuples(tuples,names = ['TICKER','DATE' ])


temp = pd.DataFrame({f:temp [f] .values},index = index)

#df = df.append temp,ignore_index = True)
df = pd.concat([df,temp],axis = 1).sortlevel()

本质上不需要NaN!

  PX_LAST OPEN_INT PX_LAST OPEN_INT PX_LAST \ 
TICKER DATE
EDH8 COMDTY 2017-02-01 98.365 1008044.0 NaN NaN NaN
2017-02-02 98.370 1009994.0 NaN NaN NaN
2017-02-03 98.360 1019181.0 NaN NaN NaN
2017 -02-06 98.405 1023863.0 NaN NaN NaN
2017-02-07 98.410 1024609.0 NaN NaN NaN
2017-02-08 98.435 1046258.0 NaN NaN NaN
2017-02-09 98.395 1050291.0 NaN NaN NaN
EDM8 COMDTY 2017-02-01 NaN NaN 98.245 726739.0 NaN
2017-02-02 NaN NaN 98.250 715081.0 NaN
2017-02-03 NaN NaN 98.235 723936.0 NaN
2017-02-06 NaN NaN 98.285 729324.0 NaN
2017-02-07 NaN NaN 98.295 728673.0 NaN
2017-02-08 NaN NaN 98.325 728520.0 NaN
2017-02-09 NaN NaN 98.280 741840.0 NaN
EDU8 COMDTY 2017-02-01 NaN NaN NaN NaN 98.130
2017-02-02 NaN NaN NaN NaN 98.135
2017-02-03 NaN NaN NaN NaN 98.120
2017-02-06 NaN NaN NaN NaN 98.180
2017-02-07 NaN NaN NaN NaN 98.190
2017-02-08 NaN NaN NaN NaN 98.225
2017-02-09 NaN NaN NaN NaN 98.175

编辑:执行轴= 0,给出以下内容:。我想让它折叠重复的日期(即,每个日期索引具有唯一的值,没有重复的日子或NaN)

  OPEN_INT PX_LAST 
TICKER DATE
EDH8 COMDTY 2017-02-01 NaN 98.365
2017-02-01 1008044.0 NaN
2017-02-02 NaN 98.370
2017- 02-02 1009994.0 NaN
2017-02-03 NaN 98.360
2017-02-03 1019181.0 NaN
2017-02-06 NaN 98.405
2017-02-06 1023863.0 NaN
2017-02-07 NaN 98.410
2017-02-07 1024609.0 NaN
2017-02-08 NaN 98.435
2017-02-08 1046258.0 NaN
2017-02 -09 NaN 98.395
2017-02-09 1050291.0 NaN
EDM8 COMDTY 2017-02-01 NaN 98.245
2017-02-01 726739.0 NaN
2017-02-02 NaN 98.250
2017-02-02 715081.0 NaN
2017-02-03 NaN 98.235
2017-02-03 723936.0 NaN
2017 -02-06 NaN 98.285
2017-02-06 729324.0 NaN
2017-02-07 NaN 98.295
2017-02-07 728673.0 NaN
2017-02-08 NaN 98.325
2017-02-08 728520.0 NaN
2017-02-09 NaN 98.280
2017-02-09 741840.0 NaN

这是打印的输入数据。我已经添加了print(df)和print(temp)到上面。它们都是以DATE为索引的数据帧。 TICKER索引来自循环中的f,对于f中的字段:

  PX_LAST 
DATE
2017-02-01 98.365
2017-02-02 98.370
2017-02-03 98.360
2017-02-06 98.405
2017-02-07 98.410
2017-02-08 98.435
2017-02-09 98.395
OPEN_INT
DATE
2017-02-01 1008044.0
2017-02-02 1009994.0
2017-02-03 1019181.0
2017-02-06 1023863.0
2017-02-07 1024609.0
2017-02-08 1046258.0
2017-02-09 1050291.0
PX_LAST
DATE
2017-02-01 98.245
2017-02-02 98.250
2017-02-03 98.235
2017-02-06 98.285
2017-02-07 98.295
2017-02-08 98.325
2017-02-09 98.280
OPEN_INT
DATE
2017-02-01 726739.0
2017-02-02 715081.0
2017-02-03 723936.0
2017-02-06 729324.0
2017-02-07 728673.0
2017-02-08 728520 。 0
2017-02-09 741840.0
PX_LAST
DATE
2017-02-01 98.130
2017-02-02 98.135
2017-02-03 98.120
2017-02-06 98.180
2017-02-07 98.190
2017-02-08 98.225
2017-02-09 98.175
OPEN_INT
DATE
2017-02-01 584448.0
2017-02-02 574246.0
2017-02-03 581897.0
2017-02-06 585169.0
2017-02-07 590248.0
2017-02-08 598478.0
2017-02-09 595884.0


解决方案

你的逻辑有点难以理解(很难看出为什么有时你从数据调用中获得不同的列)。不过,AFAICT真的只是想在所有相同代码的框架中执行 join (如果将索引设置为TICKER,DATE)或$ code> merge 如果TICKER和DATE是列,然后连接这些结果。它正试图在导致问题的一个步骤中做到这一点。



或者,我们可以把整个事情结合起来,然后进行转换,这是我将要做的因为它更容易显示。



(另外,循环内重复连接可能是一个性能问题,因为每次都需要复制大量的数据,一般应该避免 - 建立一个您想先连接的集合,然后应用。)






假设您的每个框架的开始看起来像以下(列可能不同):

 在[532]中:df 
出[532]:
PX_LAST
DATE
2017-02-01 98.365
2017-02-02 98.370
2017-02-03 98.360
2017-02-06 98.405
2017-02-07 98.410
2017-02-08 98.435
2017-02-09 98.395
/ pre>

而不是你现在在做什么,我只是添加ti转到框架并重置索引:

 在[549]中:df = df.assign(TICKER = t).reset_index ()#TICKER variable = t 
Out [549]:
DATE PX_LAST TICKER
0 2017-02-01 98.365 EDH8 COMDTY
1 2017-02-02 98.370 EDH8 COMDTY
2 2017-02-03 98.360 EDH8 COMDTY
3 2017-02-06 98.405 EDH8 COMDTY
4 2017-02-07 98.410 EDH8 COMDTY
5 2017-02-08 98.435 EDH8 COMDTY
6 2017-02-09 98.395 EDH8 COMDTY

为了使连接更多的内存 - 亲爱的,我们来解决一下:

 在[579]中:pd.melt(df,id_vars = [TICKER,DATE ])
出[579]:
TICKER DATE变量值
0 EDH8 COMDTY 2017-02-01 PX_LAST 98.365
1 EDH8 COMDTY 2017-02-02 PX_LAST 98.370
2 EDH8 COMDTY 2017-02-03 PX_LAST 98.360
3 EDH8 COMDTY 2017-02-06 PX_LAST 98.405
4 EDH8 COMDTY 2017-02-07 PX_LAST 98.410
5 EDH8 COMDTY 2017-02 -0 8 PX_LAST 98.435
6 EDH8 COMDTY 2017-02-09 PX_LAST 98.395

并将其附加到一个列表 dfs 。现在,部分框架将很好地结合在一起,因为它们都具有相同的列,我们可以进行转换以获得所需的输出:

 在[589]中:pd.concat(dfs).pivot_table(index = [TICKER,DATE],columns =variable,values =value)
输出[589]:
变量OPEN_INT PX_LAST
TICKER DATE
EDH8 COMDTY 2017-02-01 1008044.0 98.365
2017-02-02 1009994.0 98.370
2017-02-03 1019181.0 98.360
2017 -02-06 1023863.0 98.405
[...]

这避免了所有这些中间NaNs。由于连接+枢轴方法即使不融合,起初我也没有融化,但是第二个想法是,即使这些中间NaN是有效的,因为中间记忆要求可能会增长到禁止。


I am trying to merge two MultiIndex'ed dataframes. My code is below. The issue, as you can see in the output, is that the "DATE" index is repeated, whereas I'd like all the values (OPEN_INT, PX_LAST) to be on the same date index... any ideas? I've tried both append, and concat but both give me similar results.

      if df.empty:
            df = bbg_historicaldata(t, f, startDate, endDate)
            print(df)            
            datesArray = list(df.index)
            tArray = [t for i in range(len(datesArray))]
            arrays = [tArray, datesArray]
            tuples = list(zip(*arrays))
            index = pd.MultiIndex.from_tuples(tuples, names=['TICKER', 'DATE'])                    
            df = pd.DataFrame({f : df[f].values}, index=index)
    else:
        temp = bbg_historicaldata(t,f,startDate,endDate)
            print(temp)
            datesArray = list(temp.index)
            tArray = [t for i in range(len(datesArray))]
            arrays = [tArray, datesArray]
            tuples = list(zip(*arrays))
            index = pd.MultiIndex.from_tuples(tuples, names=['TICKER', 'DATE'])


            temp = pd.DataFrame({f : temp[f].values}, index=index)

            #df = df.append(temp, ignore_index = True)
            df = pd.concat([df, temp], axis = 1).sortlevel()

Essentially want no NaN's!

                        PX_LAST   OPEN_INT  PX_LAST  OPEN_INT  PX_LAST  \
TICKER      DATE                                                         
EDH8 COMDTY 2017-02-01   98.365  1008044.0      NaN       NaN      NaN   
            2017-02-02   98.370  1009994.0      NaN       NaN      NaN   
            2017-02-03   98.360  1019181.0      NaN       NaN      NaN   
            2017-02-06   98.405  1023863.0      NaN       NaN      NaN   
            2017-02-07   98.410  1024609.0      NaN       NaN      NaN   
            2017-02-08   98.435  1046258.0      NaN       NaN      NaN   
            2017-02-09   98.395  1050291.0      NaN       NaN      NaN   
EDM8 COMDTY 2017-02-01      NaN        NaN   98.245  726739.0      NaN   
            2017-02-02      NaN        NaN   98.250  715081.0      NaN   
            2017-02-03      NaN        NaN   98.235  723936.0      NaN   
            2017-02-06      NaN        NaN   98.285  729324.0      NaN   
            2017-02-07      NaN        NaN   98.295  728673.0      NaN   
            2017-02-08      NaN        NaN   98.325  728520.0      NaN   
            2017-02-09      NaN        NaN   98.280  741840.0      NaN   
EDU8 COMDTY 2017-02-01      NaN        NaN      NaN       NaN   98.130   
            2017-02-02      NaN        NaN      NaN       NaN   98.135   
            2017-02-03      NaN        NaN      NaN       NaN   98.120   
            2017-02-06      NaN        NaN      NaN       NaN   98.180   
            2017-02-07      NaN        NaN      NaN       NaN   98.190   
            2017-02-08      NaN        NaN      NaN       NaN   98.225   
            2017-02-09      NaN        NaN      NaN       NaN   98.175  

EDIT: Doing Axis = 0, gives the following:. I'd like it to collapse the duplicated dates (ie, each date index to have unique values, no duplicated days or NaNs)

                         OPEN_INT  PX_LAST
TICKER      DATE                          
EDH8 COMDTY 2017-02-01        NaN   98.365
            2017-02-01  1008044.0      NaN
            2017-02-02        NaN   98.370
            2017-02-02  1009994.0      NaN
            2017-02-03        NaN   98.360
            2017-02-03  1019181.0      NaN
            2017-02-06        NaN   98.405
            2017-02-06  1023863.0      NaN
            2017-02-07        NaN   98.410
            2017-02-07  1024609.0      NaN
            2017-02-08        NaN   98.435
            2017-02-08  1046258.0      NaN
            2017-02-09        NaN   98.395
            2017-02-09  1050291.0      NaN
EDM8 COMDTY 2017-02-01        NaN   98.245
            2017-02-01   726739.0      NaN
            2017-02-02        NaN   98.250
            2017-02-02   715081.0      NaN
            2017-02-03        NaN   98.235
            2017-02-03   723936.0      NaN
            2017-02-06        NaN   98.285
            2017-02-06   729324.0      NaN
            2017-02-07        NaN   98.295
            2017-02-07   728673.0      NaN
            2017-02-08        NaN   98.325
            2017-02-08   728520.0      NaN
            2017-02-09        NaN   98.280
            2017-02-09   741840.0      NaN

Here is the input data printed. I've added print(df) and print(temp) to the above. They're all dataframes with DATE as the index. The TICKER index comes from the variable "f" from the loop "for f in fields:"

            PX_LAST
DATE               
2017-02-01   98.365
2017-02-02   98.370
2017-02-03   98.360
2017-02-06   98.405
2017-02-07   98.410
2017-02-08   98.435
2017-02-09   98.395
             OPEN_INT
DATE                 
2017-02-01  1008044.0
2017-02-02  1009994.0
2017-02-03  1019181.0
2017-02-06  1023863.0
2017-02-07  1024609.0
2017-02-08  1046258.0
2017-02-09  1050291.0
            PX_LAST
DATE               
2017-02-01   98.245
2017-02-02   98.250
2017-02-03   98.235
2017-02-06   98.285
2017-02-07   98.295
2017-02-08   98.325
2017-02-09   98.280
            OPEN_INT
DATE                
2017-02-01  726739.0
2017-02-02  715081.0
2017-02-03  723936.0
2017-02-06  729324.0
2017-02-07  728673.0
2017-02-08  728520.0
2017-02-09  741840.0
            PX_LAST
DATE               
2017-02-01   98.130
2017-02-02   98.135
2017-02-03   98.120
2017-02-06   98.180
2017-02-07   98.190
2017-02-08   98.225
2017-02-09   98.175
            OPEN_INT
DATE                
2017-02-01  584448.0
2017-02-02  574246.0
2017-02-03  581897.0
2017-02-06  585169.0
2017-02-07  590248.0
2017-02-08  598478.0
2017-02-09  595884.0

解决方案

Your logic is a little hard to follow (it's hard to see why sometimes you're getting different columns from your data call, for example). AFAICT, though, really you just want to do a join among all the frames with the same ticker (if you set the index to TICKER, DATE) or a merge if TICKER and DATE are columns, and then concatenate the results of those. It's trying to do them both in one step which is causing the problem.

Alternatively, we can just concat the whole thing and then pivot, which is what I'll do here because it's easier to show.

(As an aside, repeatedly concatenating within a loop can be a performance problem because a lot of data needs to be copied each time, and should generally be avoided -- build a collection of what you want to concatenate first, and then apply that.)


Assuming that each of your frames starts looking like the following (where the column might be different):

In [532]: df
Out[532]: 
            PX_LAST
DATE               
2017-02-01   98.365
2017-02-02   98.370
2017-02-03   98.360
2017-02-06   98.405
2017-02-07   98.410
2017-02-08   98.435
2017-02-09   98.395

then instead of what you're doing now I'd just add the ticker to the frame and reset the index:

In [549]: df = df.assign(TICKER=t).reset_index()   #TICKER variable = t
Out[549]: 
         DATE  PX_LAST       TICKER
0  2017-02-01   98.365  EDH8 COMDTY
1  2017-02-02   98.370  EDH8 COMDTY
2  2017-02-03   98.360  EDH8 COMDTY
3  2017-02-06   98.405  EDH8 COMDTY
4  2017-02-07   98.410  EDH8 COMDTY
5  2017-02-08   98.435  EDH8 COMDTY
6  2017-02-09   98.395  EDH8 COMDTY

To make the concatenation more memory-friendly, let's melt this:

In [579]: pd.melt(df, id_vars=["TICKER", "DATE"])
Out[579]: 
        TICKER        DATE variable   value
0  EDH8 COMDTY  2017-02-01  PX_LAST  98.365
1  EDH8 COMDTY  2017-02-02  PX_LAST  98.370
2  EDH8 COMDTY  2017-02-03  PX_LAST  98.360
3  EDH8 COMDTY  2017-02-06  PX_LAST  98.405
4  EDH8 COMDTY  2017-02-07  PX_LAST  98.410
5  EDH8 COMDTY  2017-02-08  PX_LAST  98.435
6  EDH8 COMDTY  2017-02-09  PX_LAST  98.395

and append this to a list dfs. Now the partial frames will combine nicely, because they all have the same columns, and we can pivot to get our desired output:

In [589]: pd.concat(dfs).pivot_table(index=["TICKER", "DATE"], columns="variable", values="value")
Out[589]: 
variable                 OPEN_INT  PX_LAST
TICKER      DATE                          
EDH8 COMDTY 2017-02-01  1008044.0   98.365
            2017-02-02  1009994.0   98.370
            2017-02-03  1019181.0   98.360
            2017-02-06  1023863.0   98.405
[...]

This avoids having all those intermediate NaNs. Since the concatenation+pivot approach will work even if you don't melt, at first I didn't do the melting, but on second thought having those intermediate NaNs is a bad idea even though it works because the intermediate memory requirements could grow to be prohibitive.

这篇关于Python Pandas - 发布多索引数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆