DataFrame.merge()中copy = False的确切缺点是什么? [英] What are the exact downsides of copy=False in DataFrame.merge()?

查看:209
本文介绍了DataFrame.merge()中copy = False的确切缺点是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在同事问我有关DataFrame.merge()中的copy参数后,我有点困惑.

I am a bit confused about the argument copy in DataFrame.merge() after a co-worker asked me about that.

DataFrame.merge()的文档字符串指出:

copy : boolean, default True
    If False, do not copy data unnecessarily

熊猫文档状态:

copy:始终从传递的DataFrame对象复制数据(默认为True),即使不需要重新索引也是如此.在很多情况下都无法避免,但是可以提高性能/内存使用率.可以避免复制的情况有些病态,但是仍然提供了此选项.

copy: Always copy data (default True) from the passed DataFrame objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless.

这种docstring类型的意思是不需要复制数据,并且几乎总是会跳过它.另一方面,该文件说,在许多情况下都无法避免复制数据.

The docstring kind of implies that copying the data is not necessary and might be skipped nearly always. The documention on the other hand says, that copying data can't be avoided in many cases.

我的问题是:

  • 那是什么情况?
  • 有什么缺点?

推荐答案

免责声明:我对熊猫不太熟悉,这是我第一次研究熊猫的来源,所以我不能保证我不会在我的评估中遗漏了一些东西.

Disclaimer: I'm not very experienced with pandas and this is the first time I dug into its source, so I can't guarantee that I'm not missing something in my below assessment.

相关的代码位最近已重构.我将根据当前的稳定版本0.20讨论该主题,但我不怀疑与早期版本相比功能会发生变化.

The relevant bits of code have been recently refactored. I'll discuss the subject in terms of the current stable version 0.20, but I don't suspect functional changes compared to earlier versions.

调查始于 merge在pandas/core/reshape/merge.py 中的来源(

The investigation starts with the source of merge in pandas/core/reshape/merge.py (formerly pandas/tools/merge.py). Ignoring some doc-aware decorators:

def merge(left, right, how='inner', on=None, left_on=None, right_on=None,
          left_index=False, right_index=False, sort=False,
          suffixes=('_x', '_y'), copy=True, indicator=False):
    op = _MergeOperation(left, right, how=how, on=on, left_on=left_on,
                         right_on=right_on, left_index=left_index,
                         right_index=right_index, sort=sort, suffixes=suffixes,
                         copy=copy, indicator=indicator)
    return op.get_result()

调用merge会将copy参数传递给

Calling merge will pass on the copy parameter to the constructor of class _MergeOperation, then calls its get_result() method. The first few lines with context:

# TODO: transformations??
# TODO: only copy DataFrames when modification necessary
class _MergeOperation(object):
    [...]

现在,第二条评论高度可疑.接下来,copy kwarg是

Now that second comment is highly suspicious. Moving on, the copy kwarg is bound to an eponymous instance attribute, which only seems to reappear once within the class:

result_data = concatenate_block_managers(
    [(ldata, lindexers), (rdata, rindexers)],
    axes=[llabels.append(rlabels), join_index],
    concat_axis=0, copy=self.copy)

然后我们可以跟踪 pandas/core/internals.py 中的concatenate_block_managers函数只是

We can then track down the concatenate_block_managers function in pandas/core/internals.py that just passes on the copy kwarg to concatenate_join_units.

我们在如您所见,copy唯一要做的就是在实际上没有任何要串联的特殊情况下,将这里的concat_values副本重新绑定到相同的名称.

As you can see, the only thing that copy does is rebind a copy of concat_values here to the same name in the special case of concatenation when there's really nothing to concatenate.

现在,这时我开始缺乏对大熊猫的知识了,因为我不确定在调用堆栈的深处到底发生了什么.但是,上面带有copy关键字参数的热土豆方案以串联函数的类似no-op的分支结尾,与上面的"TODO"注释完全一致,即

Now, at this point my lack of pandas knowledge starts to show, because I'm not really sure what exactly is going on this deep inside the call stack. But the above hot-potato scheme with the copy keyword argument ending in that no-op-like branch of a concatenation function is perfectly consistent with the "TODO" comment above, the documentation quoted in the question:

copy:始终从传递的DataFrame对象复制数据(默认为True),即使不需要重新索引也是如此.在很多情况下都无法避免,但是可以提高性能/内存使用率. 可以避免复制的情况有些病态,但是仍然提供了此选项.

copy: Always copy data (default True) from the passed DataFrame objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless.

(重点是我的)和有关旧问题的相关讨论:

IIRC,我认为复制参数仅在这里很重要,它是一个微不足道的合并,您实际上确实希望将其复制(有点喜欢带有相同索引的重新索引)

IIRC I think the copy parameter only matters here is its a trivial merge and you actually do want it copied (kind I like a reindex with the same index)

基于这些提示,我怀疑在绝大多数实际使用案例中,复制是不可避免的,并且从未使用过copy关键字参数.但是,由于对于少数例外情况,跳过复制步骤可能会提高性能(同时对大多数用例不会造成任何性能影响),因此实施了选择.

Based on these hints I suspect that in the very vast majority of real use cases copying is inevitable, and the copy keyword argument is never used. However, since for the small number of exceptions skipping a copy step might improve performance (without leading to any performance impact whatsoever for the majority of use cases in the mean time), the choice was implemented.

我怀疑其基本原理是这样的:除非必要,否则不进行复制(仅在非常特殊的几种情况下才有可能)的好处是,在这种情况下,代码避免了一些内存分配和复制,但是 not 在非常特殊的情况下,如果不希望更改merge的返回值会以任何方式影响原始数据帧,则在极少数情况下返回副本可能会导致意外的意外.因此,copy关键字参数的默认值为True,因此,如果用户明确表示愿意为此提供服务,则用户不会从merge中获得副本(但即使如此,他们仍然很可能最终得到一个副本) ).

I suspect that the rationale is something like this: the upside of not doing a copy unless necessary (which is only possible in a very special few cases) is that the code avoids some memory allocations and copies in this case, but not returning a copy in a very special few cases might lead to unexpected surprises if one doesn't expect that mutating the return value of merge could in any way affect the original dataframe. So the default value of the copy keyword argument is True, thus the user only doesn't get a copy from merge if they explicitly volunteer for this (but even then they'll still likely end up with a copy).

这篇关于DataFrame.merge()中copy = False的确切缺点是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆