在 Pandas 数据帧上设置多索引的最佳方法 [英] Best way to set a multiindex on a pandas dataframe
问题描述
我有一个包含这些列的数据框 df
:
I have a Dataframe df
with these columns:
Group
Year
Gender
Feature_1
Feature_2
Feature_3
...
我以后想用MultiIndex来堆叠数据,我是这样试的:
I want to use MultiIndex to stack the data later, and I tried this way:
df.index = pd.MultiIndex.from_arrays([df['Group'], df['Year'], df['Gender']])
这条指令成功地为我的 Dataframe 创建了 MultiIndex,但有没有更好的方法同时删除原始列?
This instruction successfully makes MultiIndex for my Dataframe, but is there a better way that also removes the original columns?
推荐答案
在 Pandas 中建立索引比这更容易.您不需要创建自己的 MultiIndex 类实例.
Indexing in pandas is easier than this. You do not need to create your own instance of the MultiIndex class.
pandas DataFrame 有一个名为 .set_index()
的方法,它接受单列作为参数或列列表.提供列列表将为您设置多索引.
The pandas DataFrame has a method called .set_index()
which takes either a single column as argument or a list of columns. Supplying a list of columns will set a multiindex for you.
像这样:
df.set_index(['Group', 'Year', 'Gender'], inplace=True)
注意 inplace=True
,我强烈推荐它.
Note the inplace=True
, which I can recommend highly.
当您处理几乎无法放入内存的巨大数据帧时,就地操作将几乎您的内存使用量减半.
When you are dealing with huge dataframes that barely fit in memory, inplace operations will litterally half your memory usage.
考虑一下:
df2 = df1.set_index('column') # Don't do this
del df1 # Don't do this
当这个操作完成后,内存使用量会和之前差不多.但仅仅因为我们做了del df1
.在这两个命令之间的时间内,将有相同数据帧的两个副本,因此,双倍内存.
When this operation is done, the memory usage will be about the same as before. But only because we do del df1
. In the time between these two commands, there will be two copies of the same dataframe, therefore, double memory.
这样做完全一样:
df1 = df1.set_index('column') # Don't do this either
并且仍然会在原地执行此操作时占用两倍的内存.
And will still take double memory of doing this inplace.
这篇关于在 Pandas 数据帧上设置多索引的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!