在 Pandas 数据帧上设置多索引的最佳方法 [英] Best way to set a multiindex on a pandas dataframe

查看:68
本文介绍了在 Pandas 数据帧上设置多索引的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含这些列的数据框 df:

I have a Dataframe df with these columns:

Group
Year
Gender
Feature_1
Feature_2
Feature_3
...

我以后想用MultiIndex来堆叠数据,我是这样试的:

I want to use MultiIndex to stack the data later, and I tried this way:

df.index = pd.MultiIndex.from_arrays([df['Group'], df['Year'], df['Gender']])

这条指令成功地为我的 Dataframe 创建了 MultiIndex,但有没有更好的方法同时删除原始列?

This instruction successfully makes MultiIndex for my Dataframe, but is there a better way that also removes the original columns?

推荐答案

在 Pandas 中建立索引比这更容易.您不需要创建自己的 MultiIndex 类实例.

Indexing in pandas is easier than this. You do not need to create your own instance of the MultiIndex class.

pandas DataFrame 有一个名为 .set_index() 的方法,它接受单列作为参数或列列表.提供列列表将为您设置多索引.

The pandas DataFrame has a method called .set_index() which takes either a single column as argument or a list of columns. Supplying a list of columns will set a multiindex for you.

像这样:

df.set_index(['Group', 'Year', 'Gender'], inplace=True)

注意 inplace=True,我强烈推荐它.

Note the inplace=True, which I can recommend highly.

当您处理几乎无法放入内存的巨大数据帧时,就地操作将几乎您的内存使用量减半.

When you are dealing with huge dataframes that barely fit in memory, inplace operations will litterally half your memory usage.

考虑一下:

df2 = df1.set_index('column') # Don't do this
del df1 # Don't do this

当这个操作完成后,内存使用量会和之前差不多.但仅仅因为我们做了del df1.在这两个命令之间的时间内,将有相同数据帧的两个副本,因此,双倍内存.

When this operation is done, the memory usage will be about the same as before. But only because we do del df1. In the time between these two commands, there will be two copies of the same dataframe, therefore, double memory.

这样做完全一样:

df1 = df1.set_index('column') # Don't do this either

并且仍然会在原地执行此操作时占用两倍的内存.

And will still take double memory of doing this inplace.

这篇关于在 Pandas 数据帧上设置多索引的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆