如何在ASCII文件中写入/读取具有MultiIndex的Pandas DataFrame? [英] How to write/read a Pandas DataFrame with MultiIndex from/to an ASCII file?

查看:81
本文介绍了如何在ASCII文件中写入/读取具有MultiIndex的Pandas DataFrame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够为行和列索引创建一个具有MultiIndexes的Pandas DataFrame,并从ASCII文本文件中读取它.我的数据如下:

I want to be able to create a Pandas DataFrame with MultiIndexes for the rows and the columns index and read it from an ASCII text file. My data looks like:

col_indx = MultiIndex.from_tuples([('A',  'B',  'C'), ('A',  'B',  'C2'), ('A',  'B',  'C3'), 
                                   ('A',  'B2', 'C'), ('A',  'B2', 'C2'), ('A',  'B2', 'C3'), 
                                   ('A',  'B3', 'C'), ('A',  'B3', 'C2'), ('A',  'B3', 'C3'), 
                                   ('A2', 'B',  'C'), ('A2', 'B',  'C2'), ('A2', 'B',  'C3'), 
                                   ('A2', 'B2', 'C'), ('A2', 'B2', 'C2'), ('A2', 'B2', 'C3'), 
                                   ('A2', 'B3', 'C'), ('A2', 'B3', 'C2'), ('A2', 'B3', 'C3')], 
                                   names=['one','two','three']) 
row_indx = MultiIndex.from_tuples([(0,  'North', 'M'), 
                                   (1,  'East',  'F'), 
                                   (2,  'West',  'M'), 
                                   (3,  'South', 'M'), 
                                   (4,  'South', 'F'), 
                                   (5,  'West',  'F'), 
                                   (6,  'North', 'M'), 
                                   (7,  'North', 'M'), 
                                   (8,  'East',  'F'), 
                                   (9,  'South', 'M')], 
                                   names=['n', 'location', 'sex'])
size=len(row_indx), len(col_indx)
data = np.random.randint(0,10, size)
df = DataFrame(data, index=row_indx, columns=col_indx)
print df

我尝试了df.to_csv()read_csv(),但是它们没有保留索引.

I've tried df.to_csv() and read_csv() but they don't keep the index.

我当时正在考虑使用额外的分隔符来创建一种新格式.例如,使用----------------行标记列索引的末尾,并使用|标记行索引的末尾.所以看起来像这样:

I was thinking of maybe creating a new format using extra delimeters. For example, using a row of ---------------- to mark the end of the column indexes and a | to mark the end of a row index. So it would look like this:

one            | A   A   A   A   A   A   A   A   A  A2  A2  A2  A2  A2  A2  A2  A2  A2
two            | B   B   B  B2  B2  B2  B3  B3  B3   B   B   B  B2  B2  B2  B3  B3  B3
three          | C  C2  C3   C  C2  C3   C  C2  C3   C  C2  C3   C  C2  C3   C  C2  C3
--------------------------------------------------------------------------------------
n location sex :                                                                      
0 North    M   | 2   3   9   1   0   6   5   9   5   9   4   4   0   9   6   2   6   1
1 East     F   | 6   2   9   2   7   0   0   3   7   4   8   1   3   2   1   7   7   5
2 West     M   | 5   8   9   7   6   0   3   0   2   5   0   3   9   6   7   3   4   9
3 South    M   | 6   2   3   6   4   0   4   0   1   9   3   6   2   1   0   6   9   3
4 South    F   | 9   6   0   0   6   1   7   0   8   1   7   6   2   0   8   1   5   3
5 West     F   | 7   9   7   8   2   0   4   3   8   9   0   3   4   9   2   5   1   7
6 North    M   | 3   3   5   7   9   4   2   6   3   2   7   5   5   5   6   4   2   9
7 North    M   | 7   4   8   6   8   4   5   7   9   0   2   9   1   9   7   9   5   6
8 East     F   | 1   6   5   3   6   4   6   9   6   9   2   4   2   9   8   4   2   4
9 South    M   | 9   6   6   1   3   1   3   5   7   4   8   6   7   7   8   9   2   3

熊猫是否可以通过MultiIndexes向ASCII文件写入/读取数据帧?

Does Pandas have a way to write/read DataFrames to/from ASCII files with MultiIndexes?

推荐答案

不确定使用的是哪个版本的熊猫,但是通过0.7.3可以将DataFrame导出到TSV文件并通过以下操作保留索引:

Not sure which version of pandas you are using but with 0.7.3 you can export your DataFrame to a TSV file and retain the indices by doing this:

df.to_csv('mydf.tsv', sep='\t')

您需要导出为TSV vs CSV的原因是因为列标题中包含,字符.这应该可以解决您问题的第一部分.

The reason you need to export to TSV versus CSV is since the column headers have , characters in them. This should solve the first part of your question.

第二部分变得有些棘手,因为据我所知,您需要事先了解要包含DataFrame的内容.特别是,您需要知道:

The second part gets a bit more tricky since from as far as I can tell, you need to beforehand have an idea of what you want your DataFrame to contain. In particular, you need to know:

  1. TSV上的哪些列代表MultiIndex
  2. ,其余的列也应转换为MultiIndex
  1. Which columns on your TSV represent the row MultiIndex
  2. and that the rest of the columns should also be converted to a MultiIndex

为了说明这一点,让我们将上面保存的TSV文件读回到新的DataFrame:

To illustrate this, lets read back the TSV file we saved above into a new DataFrame:

In [1]: t_df = read_table('mydf.tsv', index_col=[0,1,2])
In [2]: all(t_df.index == df.index)
Out[2]: True

因此,我们设法将mydf.tsv读取到与原始df具有相同行索引的DataFrame中.但是:

So we managed to read mydf.tsv into a DataFrame that has the same row index as the original df. But:

In [3]: all(t_df.columns == df.columns)
Out[3]: False

这是因为熊猫(据我所知)无法将标头行正确解析为MultiIndex.如上所述,如果您知道您的TSV文件头表示MultiIndex,那么您可以执行以下操作来解决此问题:

And the reason here is because pandas (as far as I can tell) has no way of parsing the header row correctly into a MultiIndex. As I mentioned above, if you know beorehand that your TSV file header represents a MultiIndex then you can do the following to fix this:

In [4]: from ast import literal_eval
In [5]: t_df.columns = MultiIndex.from_tuples(t_df.columns.map(literal_eval).tolist(), 
                                              names=['one','two','three'])
In [6]: all(t_df.columns == df.columns)
Out[6]: True

这篇关于如何在ASCII文件中写入/读取具有MultiIndex的Pandas DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆