pandas 数据框视图vs复制,我怎么知道? [英] pandas dataframe view vs copy, how do I tell?

查看:58
本文介绍了 pandas 数据框视图vs复制,我怎么知道?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

之间有什么区别

熊猫df.loc[:,('col_a','col_b')]

df.loc[:,['col_a','col_b']]

下面的链接虽然有效,但并未提及后者.都拉一个视图吗?第一个拉视图,第二个拉视图吗?喜欢学习熊猫.

The link below doesn't mention the latter, though it works. Do both pull a view? Does the first pull a view and the second pull a copy? Love learning Pandas.

http://pandas.pydata.org/pandas-docs /stable/indexing.html#indexing-view-versus-copy

谢谢

推荐答案

如果您的DataFrame具有简单的列索引,则没有区别. 例如,

If your DataFrame has a simple column index, then there is no difference. For example,

In [8]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('ABC'))

In [9]: df.loc[:, ['A','B']]
Out[9]: 
   A   B
0  0   1
1  3   4
2  6   7
3  9  10

In [10]: df.loc[:, ('A','B')]
Out[10]: 
   A   B
0  0   1
1  3   4
2  6   7
3  9  10

但是,如果DataFrame具有MultiIndex,则可能会有很大的不同:

But if the DataFrame has a MultiIndex, there can be a big difference:

df = pd.DataFrame(np.random.randint(10, size=(5,4)),
                  columns=pd.MultiIndex.from_arrays([['foo']*2+['bar']*2,
                                                     list('ABAB')]),
                  index=pd.MultiIndex.from_arrays([['baz']*2+['qux']*3,
                                                   list('CDCDC')]))

#       foo    bar   
#         A  B   A  B
# baz C   7  9   9  9
#     D   7  5   5  4
# qux C   5  0   5  1
#     D   1  7   7  4
#     C   6  4   3  5

In [27]: df.loc[:, ('foo','B')]
Out[27]: 
baz  C    9
     D    5
qux  C    0
     D    7
     C    4
Name: (foo, B), dtype: int64

In [28]: df.loc[:, ['foo','B']]
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'

KeyError表示必须对MultiIndex进行分类.如果这样做,我们仍然会得到不同的结果:

The KeyError is saying that the MultiIndex has to be lexsorted. If we do that, then we still get a different result:

In [29]: df.sortlevel(axis=1).loc[:, ('foo','B')]
Out[29]: 
baz  C    9
     D    5
qux  C    0
     D    7
     C    4
Name: (foo, B), dtype: int64

In [30]: df.sortlevel(axis=1).loc[:, ['foo','B']]
Out[30]: 
      foo   
        A  B
baz C   7  9
    D   7  5
qux C   5  0
    D   1  7
    C   6  4

那是为什么? df.sortlevel(axis=1).loc[:, ('foo','B')]正在选择第一列级别等于foo,第二列级别等于B的列.

Why is that? df.sortlevel(axis=1).loc[:, ('foo','B')] is selecting the column where the first column level equals foo, and the second column level is B.

相反,df.sortlevel(axis=1).loc[:, ['foo','B']]选择第一列级别为fooB的列.关于第一列级别,没有B列,但是有两个foo列.

In contrast, df.sortlevel(axis=1).loc[:, ['foo','B']] is selecting the columns where the first column level is either foo or B. With respect to the first column level, there are no B columns, but there are two foo columns.

我认为Pandas的工作原理是,如果您将df.loc[...]用作 表达式,则应假定df.loc可能正在返回副本或视图. Pandas文档未指定您应该期望的任何规则. 但是,如果您进行以下形式的 assignment

I think the operating principle with Pandas is that if you use df.loc[...] as an expression, you should assume df.loc may be returning a copy or a view. The Pandas docs do not specify any rules about which you should expect. However, if you make an assignment of the form

df.loc[...] = value

那么您就可以信任熊猫来改变df本身.

then you can trust Pandas to alter df itself.

文档之所以警告有关视图和副本之间的区别的原因,是为了使您意识到使用表格形式的链分配的陷阱

The reason why the documentation warns about the distinction between views and copies is so that you are aware of the pitfall of using chain assignments of the form

df.loc[...][...] = value

在这里,Pandas首先评估df.loc[...],它可以是视图或副本.现在,如果它是副本,则

Here, Pandas evaluates df.loc[...] first, which may be a view or a copy. Now if it is a copy, then

df.loc[...][...] = value

正在更改df某些部分的副本,因此对df本身没有影响.更糟的是,由于没有引用副本,因此对副本的影响也丢失了,因此在赋值语句完成后就无法访问副本,因此(至少在CPython中)成为垃圾.

is altering a copy of some portion of df, and thus has no effect on df itself. To add insult to injury, the effect on the copy is lost as well since there are no references to the copy and thus there is no way to access the copy after the assignment statement completes, and (at least in CPython) it is therefore soon-to-be garbage collected.

我不知道确定df.loc[...]是否要返回视图或副本的实用的傻瓜式先验方法.

I do not know of a practical fool-proof a priori way to determine if df.loc[...] is going to return a view or a copy.

但是,有一些经验法则可能有助于指导您的直觉(但是请注意,我们在这里讨论实现细节,因此不能保证熊猫将来会以这种方式行事):

However, there are some rules of thumb which may help guide your intuition (but note that we are talking about implementation details here, so there is no guarantee that Pandas needs to behave this way in the future):

  • 如果所得的NDFrame不能表示为 底层的NumPy数组,则可能是一个副本.因此,选择任意的行或列将导致复制.选择顺序行和/或顺序列(可以表示为切片)可以返回视图.
  • 如果结果NDFrame具有不同dtypes的列,则df.loc 可能会再次返回副本.
  • If the resultant NDFrame can not be expressed as a basic slice of the underlying NumPy array, then it probably will be a copy. Thus, a selection of arbitrary rows or columns will lead to a copy. A selection of sequential rows and/or sequential columns (which may be expressed as a slice) may return a view.
  • If the resultant NDFrame has columns of different dtypes, then df.loc will again probably return a copy.

但是,有一种简单的方法可以确定x = df.loc[..]是否为视图 postiori :只需查看更改x中的值是否会影响df.如果是,那么它是一个视图,如果不是,则x是副本.

However, there is an easy way to determine if x = df.loc[..] is a view a postiori: Simply see if changing a value in x affects df. If it does, it is a view, if not, x is a copy.

这篇关于 pandas 数据框视图vs复制,我怎么知道?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆