在Pandas DataFrame子集(副本)上设置值很慢 [英] Setting values on Pandas DataFrame subset (copy) is slow

查看:1047
本文介绍了在Pandas DataFrame子集(副本)上设置值很慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

import timeit
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(10, 10))

dft = df[[True, False] * 5]
# df = dft
dft2 = dft.copy()

new_data = np.random.rand(5, 10)

print(timeit.timeit('dft.loc[:, :] = new_data', setup='from __main__ import dft, new_data', number=100))
print(timeit.timeit('dft2.loc[:, :] = new_data', setup='from __main__ import dft2, new_data', number=100))

在我的笔记本电脑设置值 dft (原始子集)比 dft2 (深度副本 dft )的设置值慢约160倍。

On my laptop setting values in dft (the original subset) is about 160 times slower than setting values in dft2 (a deep copy of dft).

编辑:删除有关代理对象的猜测。

Edit: Removed speculation about proxy objects.

As c。皮革建议,这可能是因为在副本上设置值( dft )与原始数据框( dft2 )。

As c. leather suggests, this is likely because of a different codepath when setting values on a copy (dft) vs an original dataframe (dft2).

这是正确的吗?任何想法或解释?

Is this correct? Any thoughts or explanations?

奖金问题:删除对原始DataFrame的引用 df (取消注释code> df = dft line),在笔记本电脑上将速度系数降至大约2。任何想法为什么会这样的情况?

Bonus question: removing the reference to the original DataFrame df (by uncommenting the df = dft line), cuts the speed factor to roughly 2 on my laptop. Any idea why this is the case?

推荐答案

这不完全是一个新的问题。 这是相关的帖子。 这是指向当前的文档解释了它。

This is not exactly a new question on SO. This, and this are related posts. This is the link to the current docs that explains it.

@ c.leather的评论在正确的轨道上。问题是,如链接文章中所述, dft 是一个视图,而不是数据帧 df 的副本。但是大熊猫不知道它是真的还是不是一个副本,如果操作是安全的,那么有很多检查来确保执行作业是安全的,这可以简单地避免复制一份。

The comments from @c.leather are on the right track. The problem is that dft is a view, not a copy of the dataframe df, as explained in the linked articles. But pandas cannot know whether it really is or not a copy and if the operation is safe or not, and as such there are a lot of checks going on to ensure that it is safe to perform the assignment, and that could be avoided by simply making a copy.

这是一个相关的问题,在 Github 。我看到很多建议,我最喜欢的是文档应该鼓励 df [[True,False] * 5] .copy() idiom ,可以称之为片&复制成语

This is a pertinent issue and there is a whole discussion at Github. I've seen a lot of suggestions, the one I like the most is that the docs should encourage the df[[True,False] * 5].copy() idiom, one may call it the slice & copy idiom.

我找不到确切的检查,而在github问题上,这种性能细微差别只能通过一些开发者注意到这些行为的一些tweets来提及。也许更多参与熊猫发展的人可以增加更多的投入。

I could not find the exact checks, and on the github issue this performance nuance is only mentioned through some tweets a few developers posted noting the behavior. Maybe someone more involved in the pandas development can add some more input.

这篇关于在Pandas DataFrame子集(副本)上设置值很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆