pandas将multicolumnindex应用于数据帧 [英] pandas applying multicolumnindex to dataframe

查看:253
本文介绍了pandas将multicolumnindex应用于数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

情况是我有几个文件包含time_series数据,用于包含多个字段的各种股票。每个文件包含

The situation is that I have a few files with time_series data for various stocks with several fields. each file contains

time, open, high, low, close, volume

目标是将所有内容整合到表格的一个数据框中

the goal is to get that all into one dataframe of the form

field      open                              high                            ...
security    hk_1      hk_2      hk_3 ...      hk_1      hk_2      hk_3 ...  ...
time
t_1      open_1_1  open_2_1  open_3_1 ...  high_1_1  high_2_1  high_3_1 ...  ...            
t_2      open_1_2  open_2_2  open_3_2 ...  high_1_2  high_2_2  high_3_2 ...  ...
...        ...        ...       ... ...       ...       ...       ... ...  ...

我创建了一个多索引

fields = ['time','open','high','low','close','volume','numEvents','value']
midx = pd.MultiIndex.from_product([security_name'], fields], names=['security', 'field'])

并且开始尝试将多索引应用于从csv读取数据得到的数据帧(通过创建新数据帧并添加索引)

and for a start, tried to apply that multiindex to the dataframe I get from reading the data from csv (by creating a new dataframe and adding the index)

for c in eqty_names_list:

    midx = pd.MultiIndex.from_product([[c], fields], names=['security', 'field'])

    df_temp = pd.read_csv('{}{}.csv'.format(path, c))
    df_temp = pd.DataFrame(df_temp, columns=midx, index=df_temp['time'])
    df_temp.df_name = c
    all_dfs.append(df_temp)

但是,仅限新数据帧包含nan

However, the new dataframe only contains nan

security    1_HK
field       time    open    high    low     close   volume
time                                
 NaN         NaN     NaN     NaN    NaN       NaN      NaN

此外,它仍包含一段时间列,尽管我试图制作索引(以便我以后可以通过索引加入其他股票的所有其他数据框以获得聚合gated dataframe)。

Also, it still contains a column for time, although I tried to make that the index (so that I can later join all the other dataframes for other stocks by index to get the aggregated dataframe).

如何在不丢失数据的情况下将多索引应用于数据帧,然后加入看起来像这样的数据帧

How can I apply the multiindex to the dataframe without losing my data and then later join the dataframes looking like this

security    1_HK
field       time    open    high    low     close   volume
time

创建类似的东西(注释层次结构字段和安全性已切换)

to create something like (note hierarchy field and security are switched)

field       time                open    high        ...
security    1_HK    2_HK ...    1_HK    2_HK ...    ...
time


推荐答案

我认为你可以先将所有文件列入文件,然后使用list comprehension获取所有DataFrames和 concat (轴= 1)。如果添加参数 keys ,则在列中获得 Multiindex

I think you can first get all files to list files, then with list comprehension get all DataFrames and concat them by columns (axis=1). If add parameter keys, you get Multiindex in columns:

文件:

a.csv
b.csv
c.csv

import pandas as pd
import glob

files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp) for fp in files]

eqty_names_list = ['hk1','hk2','hk3']
df = pd.concat(dfs, keys=eqty_names_list, axis=1)

print (df)
  hk1       hk2       hk3      
    a  b  c   a  b  c   a  b  c
0   0  1  2   0  9  6   0  7  1
1   1  5  8   1  6  4   1  3  2

最后需要 swaplevel sort_index

Last need swaplevel and sort_index:

df.columns = df.columns.swaplevel(0,1)
df = df.sort_index(axis=1)
print (df)
    a           b           c        
  hk1 hk2 hk3 hk1 hk2 hk3 hk1 hk2 hk3
0   0   0   0   1   9   7   2   6   1
1   1   1   1   5   6   3   8   4   2

这篇关于pandas将multicolumnindex应用于数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆