如果未定义索引操作是返回视图还是副本,则 pandas 的视图有什么意义? [英] What is the point of views in pandas if it is undefined whether an indexing operation returns a view or a copy?

查看:110
本文介绍了如果未定义索引操作是返回视图还是副本,则 pandas 的视图有什么意义?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经从R转到了熊猫.当我做类似的事情时,我通常会得到SettingWithCopyWarnings

I have switched from R to pandas. I routinely get SettingWithCopyWarnings, when I do something like

df_a = pd.DataFrame({'col1': [1,2,3,4]})    

# Filtering step, which may or may not return a view
df_b = df_a[df_a['col1'] > 1]

# Add a new column to df_b
df_b['new_col'] = 2 * df_b['col1']

# SettingWithCopyWarning!!

我想我理解问题所在,尽管我很乐意知道自己做错了什么.在给定的示例中,df_b是否为df_a上的视图是不确定的.因此,分配给df_b的效果尚不清楚:它会影响df_a吗?可以通过在过滤时显式制作一个副本来解决该问题:

I think I understand the problem, though I'll gladly learn what I got wrong. In the given example, it is undefined whether df_b is a view on df_a or not. Thus, the effect of assigning to df_b is unclear: does it affect df_a? The problem can be solved by explicitly making a copy when filtering:

df_a = pd.DataFrame({'col1': [1,2,3,4]})    

# Filtering step, definitely a copy now
df_b = df_a[df_a['col1'] > 1].copy()

# Add a new column to df_b
df_b['new_col'] = 2 * df_b['col1']

# No Warning now

我认为我缺少一些东西:如果我们永远无法真正确定是否创建视图,那么视图有什么用?摘自pandas文档( http://pandas -docs.github.io/pandas-docs-travis/indexing.html?highlight=view#indexing-view-versus-copy )

I think there is something that I am missing: if we can never really be sure whether we create a view or not, what are views good for? From the pandas documentation (http://pandas-docs.github.io/pandas-docs-travis/indexing.html?highlight=view#indexing-view-versus-copy)

除了简单的情况外,很难预测[ getitem ]将返回视图还是副本(取决于数组的内存布局,熊猫无法保证该数组)

Outside of simple cases, it’s very hard to predict whether it [getitem] will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees)

对于不同的索引编制方法,可以找到类似的警告.

Similar warnings can be found for different indexing methods.

我发现在整个代码中散布.copy()调用非常麻烦且容易出错.我使用错误的样式来操纵我的DataFrames吗?还是性能提升如此之高以至于可以证明表面上的尴尬?

I find it very cumbersome and errorprone to sprinkle .copy() calls throughout my code. Am I using the wrong style for manipulating my DataFrames? Or is the performance gain so high that it justifies the apparent awkwardness?

推荐答案

好问题!

简短的回答是:这是熊猫中的一个缺陷,正在纠正.

The short answer is: this is a flaw in pandas that's being remedied.

您可以在此处找到有关问题的本质的更长的讨论,但主要的收获是现在我们正在转向写时复制"行为,在这种行为中,任何时候您切片时都会得到一个新副本,而您不必考虑视图.该修复程序很快就会通过此重构项目来完成.我实际上是试图直接对其进行修复(

You can find a longer discussion of the nature of the problem here, but the main take-away is that we're now moving to a "copy-on-write" behavior in which any time you slice, you get a new copy, and you never have to think about views. The fix will soon come through this refactoring project. I actually tried to fix it directly (see here), but it just wasn't feasible in the current architecture.

实际上,我们会将视图保留在后台-它们可以使熊猫在提供超级熊猫时高效且快速的存储-但最终我们会将其隐藏给用户,因此从用户的角度出发,如果您对DataFrame进行切片,索引或剪切,您得到的实际上将是一个新副本.

In truth, we'll keep views in the background -- they make pandas SUPER memory efficient and fast when they can be provided -- but we'll end up hiding them from users so, from the user perspective, if you slice, index, or cut a DataFrame, what you get back will effectively be a new copy.

(这是通过在用户仅读取数据时创建视图来实现的,但是只要使用赋值操作,该视图将在赋值发生之前转换为副本.)

(This is accomplished by creating views when the user is only reading data, but whenever an assignment operation is used, the view will be converted to a copy before the assignment takes place.)

最好的猜测是修复会在一年之内-同时,恐怕可能需要一些.copy(),对不起!

Best guess is the fix will be in within a year -- in the mean time, I'm afraid some .copy() may be necessary, sorry!

这篇关于如果未定义索引操作是返回视图还是副本,则 pandas 的视图有什么意义?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆