pandas 操作期间的进度指示器 [英] Progress indicator during pandas operations

查看:23
本文介绍了 pandas 操作期间的进度指示器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常对超过 1500 万行左右的数据帧执行 Pandas 操作,我希望能够访问特定操作的进度指示器.

I regularly perform pandas operations on data frames in excess of 15 million or so rows and I'd love to have access to a progress indicator for particular operations.

是否存在用于 Pandas split-apply-combine 操作的基于文本的进度指示器?

Does a text based progress indicator for pandas split-apply-combine operations exist?

例如,像这样:

df_users.groupby(['userID', 'requestDate']).apply(feature_rollup)

其中 feature_rollup 是一个有点复杂的函数,它采用许多 DF 列并通过各种方法创建新的用户列.对于大型数据帧,这些操作可能需要一段时间,所以我想知道是否可以在 iPython 笔记本中提供基于文本的输出,以便更新我的进度.

where feature_rollup is a somewhat involved function that take many DF columns and creates new user columns through various methods. These operations can take a while for large data frames so I'd like to know if it is possible to have text based output in an iPython notebook that updates me on the progress.

到目前为止,我已经尝试了 Python 的规范循环进度指示器,但它们并没有以任何有意义的方式与 Pandas 交互.

So far, I've tried canonical loop progress indicators for Python but they don't interact with pandas in any meaningful way.

我希望在 Pandas 库/文档中有一些我忽略的东西,可以让人们知道拆分-应用-组合的进度.一个简单的实现可能会查看 apply 函数在其上工作的数据帧子集的总数,并将进度报告为这些子集的完成部分.

I'm hoping there's something I've overlooked in the pandas library/documentation that allows one to know the progress of a split-apply-combine. A simple implementation would maybe look at the total number of data frame subsets upon which the apply function is working and report progress as the completed fraction of those subsets.

这是否可能需要添加到库中?

Is this perhaps something that needs to be added to the library?

推荐答案

由于大众需求,我在 tqdm (pip installtqdm>=4.9.0").与其他答案不同,这不会明显减慢熊猫的速度 -- 这是 DataFrameGroupBy.progress_apply 的示例:

Due to popular demand, I've added pandas support in tqdm (pip install "tqdm>=4.9.0"). Unlike the other answers, this will not noticeably slow pandas down -- here's an example for DataFrameGroupBy.progress_apply:

import pandas as pd
import numpy as np
from tqdm import tqdm
# from tqdm.auto import tqdm  # for notebooks

# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()

df = pd.DataFrame(np.random.randint(0, int(1e8), (10000, 1000)))
# Now you can use `progress_apply` instead of `apply`
df.groupby(0).progress_apply(lambda x: x**2)

如果您对它的工作原理(以及如何为您自己的回调修改它)感兴趣,请参阅 GitHub 上的示例关于 PyPI 的完整文档,或导入模块并运行 help(tqdm).其他支持的函数包括 mapapplymapaggregatetransform.

In case you're interested in how this works (and how to modify it for your own callbacks), see the examples on GitHub, the full documentation on PyPI, or import the module and run help(tqdm). Other supported functions include map, applymap, aggregate, and transform.

编辑

要直接回答原问题,请替换:

To directly answer the original question, replace:

df_users.groupby(['userID', 'requestDate']).apply(feature_rollup)

与:

from tqdm import tqdm
tqdm.pandas()
df_users.groupby(['userID', 'requestDate']).progress_apply(feature_rollup)

注意:tqdm <= v4.8:对于低于 4.8 的 tqdm 版本,您必须执行以下操作,而不是 tqdm.pandas():

Note: tqdm <= v4.8: For versions of tqdm below 4.8, instead of tqdm.pandas() you had to do:

from tqdm import tqdm, tqdm_pandas
tqdm_pandas(tqdm())

这篇关于 pandas 操作期间的进度指示器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆