如果索引操作返回的是视图还是副本是未定义的,那么 Pandas 中的观点是什么? [英] What is the point of views in pandas if it is undefined whether an indexing operation returns a view or a copy?

查看:27
本文介绍了如果索引操作返回的是视图还是副本是未定义的,那么 Pandas 中的观点是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已从 R 切换到 Pandas.当我做类似的事情时,我经常得到 SettingWithCopyWarnings

I have switched from R to pandas. I routinely get SettingWithCopyWarnings, when I do something like

df_a = pd.DataFrame({'col1': [1,2,3,4]})    

# Filtering step, which may or may not return a view
df_b = df_a[df_a['col1'] > 1]

# Add a new column to df_b
df_b['new_col'] = 2 * df_b['col1']

# SettingWithCopyWarning!!

我想我明白这个问题了,尽管我很乐意知道我做错了什么.在给定的示例中,未定义 df_b 是否是 df_a 上的视图.因此,分配给 df_b 的效果尚不清楚:它会影响 df_a 吗?这个问题可以通过在过滤时显式复制来解决:

I think I understand the problem, though I'll gladly learn what I got wrong. In the given example, it is undefined whether df_b is a view on df_a or not. Thus, the effect of assigning to df_b is unclear: does it affect df_a? The problem can be solved by explicitly making a copy when filtering:

df_a = pd.DataFrame({'col1': [1,2,3,4]})    

# Filtering step, definitely a copy now
df_b = df_a[df_a['col1'] > 1].copy()

# Add a new column to df_b
df_b['new_col'] = 2 * df_b['col1']

# No Warning now

我认为我遗漏了一些东西:如果我们永远无法确定是否创建了视图,那么视图有什么用?来自熊猫文档(http://pandas-docs.github.io/pandas-docs-travis/indexing.html?highlight=view#indexing-view-versus-copy)

I think there is something that I am missing: if we can never really be sure whether we create a view or not, what are views good for? From the pandas documentation (http://pandas-docs.github.io/pandas-docs-travis/indexing.html?highlight=view#indexing-view-versus-copy)

除了简单的情况,很难预测它 [getitem] 会返回一个视图还是一个副本(这取决于数组的内存布局,pandas 对此不做任何保证)

Outside of simple cases, it’s very hard to predict whether it [getitem] will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees)

可以为不同的索引方法找到类似的警告.

Similar warnings can be found for different indexing methods.

我发现在我的代码中散布 .copy() 调用非常麻烦且容易出错.我是否使用了错误的样式来操作我的 DataFrame?还是性能提升如此之高,以至于可以证明明显的尴尬?

I find it very cumbersome and errorprone to sprinkle .copy() calls throughout my code. Am I using the wrong style for manipulating my DataFrames? Or is the performance gain so high that it justifies the apparent awkwardness?

推荐答案

好问题!

简短的回答是:这是熊猫的一个缺陷,正在修复中.

The short answer is: this is a flaw in pandas that's being remedied.

您可以在此处找到关于问题的性质的更长讨论,但主要内容是我们现在正在转向写时复制"行为,在这种行为中,任何时候你切片,你都会得到一个新副本,你永远不必考虑视图.修复很快就会通过这个重构项目.我实际上试图直接修复它(参见此处),但在当前架构中这是不可行的.

You can find a longer discussion of the nature of the problem here, but the main take-away is that we're now moving to a "copy-on-write" behavior in which any time you slice, you get a new copy, and you never have to think about views. The fix will soon come through this refactoring project. I actually tried to fix it directly (see here), but it just wasn't feasible in the current architecture.

事实上,我们会将视图保留在后台——当它们可以提供时,它们使 Pandas 超级内存高效且快速——但我们最终会将它们隐藏起来,因此,从用户的角度来看,如果你对 DataFrame 进行切片、索引或剪切,您得到的实际上将是一个新副本.

In truth, we'll keep views in the background -- they make pandas SUPER memory efficient and fast when they can be provided -- but we'll end up hiding them from users so, from the user perspective, if you slice, index, or cut a DataFrame, what you get back will effectively be a new copy.

(这是通过在用户只读取数据时创建视图来实现的,但是每当使用赋值操作时,视图将在赋值之前转换为副本.)

(This is accomplished by creating views when the user is only reading data, but whenever an assignment operation is used, the view will be converted to a copy before the assignment takes place.)

最好的猜测是修复将在一年内完成——同时,恐怕一些 .copy() 可能是必要的,抱歉!

Best guess is the fix will be in within a year -- in the mean time, I'm afraid some .copy() may be necessary, sorry!

这篇关于如果索引操作返回的是视图还是副本是未定义的,那么 Pandas 中的观点是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆