数据框切片不会删除索引值 [英] Dataframe Slice does not remove Index Values

查看:46
本文介绍了数据框切片不会删除索引值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在大型数据框及其相关的多重索引上遇到了这个问题. 这个简化的示例将演示该问题.

I recently had this issue with a large dataframe and its associated multi index. This simplified example will demonstrate the issue.

import pandas as pd
import numpy as np

np.random.seed(1)
idx = pd.MultiIndex.from_product([['A','B'],[5,6]])


df = pd.DataFrame(data= np.random.randint(1,100,(4)),index= idx,columns =['P'])
print df

哪个产量:

      P
A 5  38
  6  13
B 5  73
  6  10

现在快速浏览一下索引

print df.index

MultiIndex(levels=[[u'A', u'B'], [5, 6]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

如果我对这个数据帧进行切片,我会注意到多重索引永远不会凝聚. 即使有很深的复制.

If I slice this dataframe I notice that the multi index never condenses. Even with a deep copy.

在分片操作中减少索引的内存占用的最佳方法是什么?

What is the best way to reduce the memory footprint of the index in a slice operation?

df_slice = df[df['P']>20]
print df_slice
print df_slice.index

      P
A 5  38
B 5  73

查看数据帧如何减少,但索引没有减少.

See how the dataframe has reduced, but the index has not.

MultiIndex(levels=[[u'A', u'B'], [5, 6]],
           labels=[[0, 1], [0, 0]])

即使带有.copy(deep = True)

Even with a .copy(deep=True)

df_slice = df[df['P']>20].copy(deep=True)
print df_slice.index


MultiIndex(levels=[[u'A', u'B'], [5, 6]]
    ,labels=[[0, 1], [0, 0]])

我希望MultiIndex删除6个,如下所示:

I would have expected MultiIndex to have the 6 removed as shown:

MultiIndex(levels=[[u'A', u'B'], [5]]
    ,labels=[[0, 1], [0, 0]])

当数据框很大时,这个问题就会在实践中出现.

The issue comes in practice when the dataframe is large.

推荐答案

我了解您的担忧,但我相信您必须了解熊猫低级应用程序中正在发生的事情.

I understand your concern, but I believe you have to see what is happening in pandas low-level application.

首先,我们必须声明索引应该是不可变的.您可以在此处查看其更多文档-> http://pandas .pydata.org/pandas-docs/stable/indexing.html#setting-metadata

First, we must declare that indexes are supposed to be immutable. You can check more of its documentation here -> http://pandas.pydata.org/pandas-docs/stable/indexing.html#setting-metadata

创建数据框对象时,我们将其命名为df,并且要访问其行,基本上,您要做的只是传递一个布尔值系列,Pandas将与其对应的索引匹配.

When you create a dataframe object, let's name it df and you want to access its rows, basically all you do is passing a boolean series that Pandas will match with its corresponding index.

遵循以下示例:

index = pd.MultiIndex.from_product([['A','B'],[5,6]])
df = pd.DataFrame(data=np.random.randint(1,100,(4)), index=index, columns=["P"])

      P
A 5   5
  6  51
B 5  93
  6  76

现在,假设我们要选择 P> 90 的行.你会怎么做? df[df["P"] > 90],对吧?但是看看df ["P"]> 90实际返回了什么.

Now, let's say we want to select the rows with P > 90. How would you do that? df[df["P"] > 90], right? But look at what df["P"] > 90 actually returns.

A  5     True
   6     True
B  5     True
   6    False
Name: P, dtype: bool

如您所见,它返回一个与原始索引匹配的布尔系列.为什么?由于熊猫需要映射哪些索引值具有相等的真实值,因此它可以选择适当的结果.因此,基本上,在切片操作期间,您将始终携带此索引,因为它是对象的映射元素.

As you can see, it returns a boolean series matching the original index. Why? Because Pandas needs to map which index values have an equivalent true value, so it can select the proper outcome. So basically, during your slice opperations, you will always carry this index, because it is a mapping element for the object.

但是,希望没有消失.根据您的应用程序,如果您认为它实际上占用了很大一部分内存,则可以花一些时间来执行以下操作:

However, hope is not gone. Depending on your application, if you believe it is actually taking a huge portion of your memory, you can spend a little time doing the following:

def df_sliced_index(df):
    new_index = []
    rows = []
    for ind, row in df.iterrows():
        new_index.append(ind)
        rows.append(row)
    return pd.DataFrame(data=rows, index=pd.MultiIndex.from_tuples(new_index))

df_sliced_index(df[df['P'] > 90]).index

我所相信的是期望的输出:

Which yields what I believe, is the desired output:

MultiIndex(levels=[[u'B'], [5]], labels=[[0], [0]])

但是,如果数据太大而使您不必担心索引的大小,那么我想知道从时间上看它会花费多少钱.

But if data is too large to worry you about the size of index, I wonder how much it may cost you in terms of time.

这篇关于数据框切片不会删除索引值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆