Pandas:了解操作何时影响原始数据帧 [英] Pandas: Knowing when an operation affects the original dataframe

查看:31
本文介绍了Pandas:了解操作何时影响原始数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我喜欢 pandas 并且已经使用它多年,并且非常有信心我可以很好地处理如何对数据帧进行子集化以及如何适当地处理视图与副本(尽管我使用了很多断言来确定).我也知道有很多关于 SettingWithCopyWarning 的问题,例如如何处理 Pandas 中的 SettingWithCopyWarning?以及一些关于在它发生时绕着你的头的最新指南,例如了解熊猫中的 SettingWithCopyWarning.

I love pandas and have been using it for years and feel pretty confident I have a good handle on how to subset dataframes and deal with views vs copies appropriately (though I use a lot of assertions to be sure). I also know that there have been tons of questions about SettingWithCopyWarning, e.g. How to deal with SettingWithCopyWarning in Pandas? and some great recent guides on wrapping your head around when it happens, e.g. Understanding SettingWithCopyWarning in pandas.

但我也知道一些特定的东西,比如来自 这个答案 的引用已经不在最新的文档中(0.22.0) 并且这些年来很多东西已经被弃用(导致一些不合适的旧 SO 答案),而且这些东西是 继续改变.

But I also know specific things like the quote from this answer are no longer in the most recent docs (0.22.0) and that many things have been deprecated over the years (leading to some inappropriate old SO answers), and that things are continuing to change.

最近在教 Pandas 以非常基本的通用 Python 知识完成新手知识后,例如避免链式索引(和使用 .iloc/.loc),我已经仍在努力提供一般经验法则,以了解何时需要关注 SettingWithCopyWarning(例如,何时可以安全地忽略它).

Recently after teaching pandas to complete newcomers with very basic general Python knowledge about things like avoiding chained-indexing (and using .iloc/.loc), I've still struggled to provide general rules of thumb to know when it's important to pay attention to the SettingWithCopyWarning (e.g. when it's safe to ignore it).

我个人发现根据某些规则(例如切片或布尔运算)对数据帧进行子集化,然后修改该子集(独立于原始数据帧)的特定模式要多得多比文档建议的常见操作.在这种情况下,我们希望修改副本而不是原始,并且警告对于新手来说是令人困惑/害怕的.

I've personally found that the specific pattern of subsetting a dataframe according so some rule (e.g. slicing or boolean operation) and then modifying that subset, independent of the original dataframe, is a much more common operation than the docs suggest. In this situation we want to modify the copy not the original and the warning is confusing/scary to newcomers.

我知道提前知道返回视图与副本的时间并非易事,例如
Pandas 使用哪些规则来生成查看还是复制?
在 Pandas 中检查数据框是复制还是查看

I know it's not trivial to know ahead of time when a view vs a copy is returned, e.g.
What rules does Pandas use to generate a view vs a copy?
Checking whether data frame is copy or view in Pandas

因此,我正在寻找更一般(初学者友好)问题的答案:对子集数据帧执行操作何时会影响创建它的原始数据帧,它们何时独立?.

So instead I'm looking for the answer to a more general (beginner friendly) question: when does performing an operation on a subsetted dataframe affect the original dataframe from which it was created, and when are they independent?.

我在下面创建了一些我认为看起来合理的案例,但我不确定是否有我遗漏的陷阱",或者是否有任何更简单的思考/检查方法.我希望有人能确认我对以下用例的直觉是正确的,因为这与我上面的问题有关.

I've created some cases below that I think seem reasonable, but I'm not sure if there's a "gotcha" I'm missing or if there's any easier way to think/check this. I was hoping someone could confirm that my intuitions about the following use cases are correct as the pertain to my question above.

import pandas as pd
df1 = pd.DataFrame({'A':[2,4,6,8,10],'B':[1,3,5,7,9],'C':[10,20,30,40,50]})

1) 警告:否
原文已更改:否

1) Warning: No
Original changed: No

# df1 will be unaffected because we use .copy() method explicitly 
df2 = df1.copy()
#
# Reference: docs
df2.iloc[0,1] = 100

2) 警告:是的(我真的不明白为什么)
原文已更改:否

2) Warning: Yes (I don't really understood why)
Original changed: No

# df1 will be unaffected because .query() always returns a copy
#
# Reference:
# https://stackoverflow.com/a/23296545/8022335
df2 = df1.query('A < 10')
df2.iloc[0,1] = 100

3) 警告:是的
原文已更改:否

3) Warning: Yes
Original changed: No

# df1 will be unaffected because boolean indexing with .loc
# always returns a copy
#
# Reference:
# https://stackoverflow.com/a/17961468/8022335
df2 = df1.loc[df1['A'] < 10,:]
df2.iloc[0,1] = 100

4) 警告:否
原文已更改:否

4) Warning: No
Original changed: No

# df1 will be unaffected because list indexing with .loc (or .iloc)
# always returns a copy
#
# Reference:
# Same as 4)
df2 = df1.loc[[0,3,4],:]
df2.iloc[0,1] = 100

5) 警告:否
原文更改:是的(对新人来说很困惑但很有意义)

5) Warning: No
Original changed: Yes (confusing to newcomers but makes sense)

# df1 will be affected because scalar/slice indexing with .iloc/.loc
# always references the original dataframe, but may sometimes 
# provide a view and sometimes provide a copy
#
# Reference: docs
df2 = df1.loc[:10,:]
df2.iloc[0,1] = 100

tl;博士从原始数据帧创建新数据帧时,更改新数据帧:
使用带有 .loc/.iloc 的标量/切片索引来创建新数据框时将更改原始数据.
当使用 .loc、.query().copy()布尔索引创建时,不会改变原始新数据框

tl;dr When creating a new dataframe from the original, changing the new dataframe:
Will change the original when scalar/slice indexing with .loc/.iloc is used to create the new dataframe.
Will not change the original when boolean indexing with .loc, .query(), or .copy() is used to create the new dataframe

推荐答案

这是 Pandas 的一个有点令人困惑甚至令人沮丧的部分,但在大多数情况下,如果您遵循一些简单的工作流程,您就不必担心这一点规则.特别要注意的是,当您有两个数据帧时,这里只有两种一般情况,其中一个是另一个的子集.

This is a somewhat confusing and even frustrating part of pandas, but for the most part you shouldn't really have to worry about this if you follow some simple workflow rules. In particular, note that there are only two general cases here when you have two dataframes, with one being a subset of the other.

在这种情况下,Python 的 Zen 规则显式优于隐式"是一个很好的指导方针.

This is a case where the Zen of Python rule "explicit is better than implicit" is a great guideline to follow.

当然,这是微不足道的.您需要两个完全独立的数据框,因此您只需明确复制:

This is trivial, of course. You want two completely independent dataframes so you just explicitly make a copy:

df2 = df1.copy()

此后,您对 df2 所做的任何事情只会影响 df2 而不会影响 df1,反之亦然.

After this anything you do to df2 affects only df2 and not df1 and vice versa.

在这种情况下,我认为没有一种通用的方法可以解决问题,因为这完全取决于您要尝试做什么.但是,有一些标准方法非常简单,并且对于它们的工作方式不应有任何歧义.

In this case I don't think there is one general way to solve the problem because it depends on exactly what you're trying to do. However, there are a couple of standard approaches that are pretty straightforward and should not have any ambiguity about how they are working.

方法一:将df1复制到df2,再用df2更新df1

在这种情况下,您基本上可以对上述示例进行一对一转换.这是示例#2:

In this case, you can basically do a one to one conversion of the examples above. Here's example #2:

df2 = df1.copy()
df2 = df1.query('A < 10')
df2.iloc[0,1] = 100

df1 = df2.append(df1).reset_index().drop_duplicates(subset='index').drop(columns='index')

不幸的是,通过 append 重新合并在那里有点冗长.您可以使用以下内容更干净地完成它,尽管它具有将整数转换为浮点数的副作用.

Unfortunately the re-merging via append is a bit verbose there. You can do it more cleanly with the following, although it has the side effect of converting integers to floats.

df1.update(df2)   # note that this is an inplace operation

方法二:使用掩码(完全不要创建df2)

我认为这里最好的通用方法根本不是创建 df2,而是让它成为 df1 的掩码版本.不幸的是,由于上面的代码混合了 lociloc,因此您无法直接翻译上述代码,这对于本示例来说很好,但在实际使用中可能不切实际.

I think the best general approach here is not to create df2 at all, but rather have it be a masked version of df1. Somewhat unfortunately, you can't do a direct translation of the above code due to its mixing of loc and iloc which is fine for this example though probably unrealistic for actual use.

优点是可以编写非常简单易读的代码.这是上面示例 #2 的替代版本,其中 df2 实际上只是 df1 的掩码版本.但不是通过 iloc 更改,如果列C"== 10,我将更改.

The advantage is that you can write very simple and readable code. Here's an alternative version of example #2 above where df2 is actually just a masked version of df1. But instead of changing via iloc, I'll change if column "C" == 10.

df2_mask = df1['A'] < 10
df1.loc[ df2_mask & (df1['C'] == 10), 'B'] = 100

现在,如果您打印 df1df1[df2_mask],您将看到每个数据帧的第一行的B"列 = 100.显然,这在这里并不令人惊讶,但这就是遵循显式优于隐式"的固有优势.

Now if you print df1 or df1[df2_mask] you will see that column "B" = 100 for the first row of each dataframe. Obviously this is not very surprising here, but that's the inherent advantage of following "explicit is better than implicit".

这篇关于Pandas:了解操作何时影响原始数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆