大 pandas 运营期间的进度指标(python) [英] Progress indicator during pandas operations (python)

查看:224
本文介绍了大 pandas 运营期间的进度指标(python)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常对超过1500万行的数据帧执行pandas操作,我很乐意访问特定操作的进度指示器。



是否存在基于文本的pandas split-apply-combine操作的进度指示器?



例如,类似于:

  df_users.groupby(['userID','requestDate'])。apply(feature_rollup)

其中 feature_rollup 是一个涉及很多DF列并通过各种方法创建新用户列的功能。这些操作对于大型数据帧可能需要一段时间,因此我想知道是否可以在iPython笔记本中使用基于文本的输出来更新我的进度。



<到目前为止,我已经尝试过Python的规范循环进度指示器,但它们不会以任何有意义的方式与pandas交互。



我希望有一些东西我在熊猫图书馆/文档中被忽略了,这使得人们可以知道分割 - 应用 - 组合的进展。一个简单的实现可能会查看 apply 函数正在运行的数据帧子集的总数,并将进度报告为这些子集的已完成部分。



这可能需要添加到库中吗?

解决方案

由于热门需求, tqdm 增加了对 pandas 的支持。与其他答案不同,这个不会明显放慢熊猫的速度 - 这是 DataFrameGroupBy.progress_apply 的一个例子:

 将pandas作为pd 
从nq
导入numpy来自tqdm import tqdm

df = pd.DataFrame( np.random.randint(0,int(1e8),(10000,1000)))

#用`pandas`
#创建并注册一个新的`tqdm`实例(可以使用) tqdm_gui,可选的kwargs等)
tqdm.pandas()

#现在你可以使用`progress_apply`而不是`apply`
df.groupby(0).progress_apply (lambda x:x ** 2)

如果你对它的工作原理感兴趣(和如何为自己的回调修改它,请参阅 github上的示例有关pypi的完整文档,或导入模块并运行 help(tqdm)



EDI T






要直接回答原始问题,请替换:

  df_users.groupby(['userID','requestDate'])。apply(feature_rollup)

with:

 来自tqdm import tqdm 
tqdm.pandas( )
df_users.groupby(['userID','requestDate'])。progress_apply(feature_rollup)

注意:tqdm< = v4.8
对于低于4.8的tqdm版本,而不是 tqdm.pandas()你必须这样做:

 来自tqdm import tqdm,tqdm_pandas 
tqdm_pandas(tqdm())


I regularly perform pandas operations on data frames in excess of 15 million or so rows and I'd love to have access to a progress indicator for particular operations.

Does a text based progress indicator for pandas split-apply-combine operations exist?

For example, in something like:

df_users.groupby(['userID', 'requestDate']).apply(feature_rollup)

where feature_rollup is a somewhat involved function that take many DF columns and creates new user columns through various methods. These operations can take a while for large data frames so I'd like to know if it is possible to have text based output in an iPython notebook that updates me on the progress.

So far, I've tried canonical loop progress indicators for Python but they don't interact with pandas in any meaningful way.

I'm hoping there's something I've overlooked in the pandas library/documentation that allows one to know the progress of a split-apply-combine. A simple implementation would maybe look at the total number of data frame subsets upon which the apply function is working and report progress as the completed fraction of those subsets.

Is this perhaps something that needs to be added to the library?

解决方案

Due to popular demand, tqdm has added support for pandas. Unlike the other answers, this will not noticeably slow pandas down -- here's an example for DataFrameGroupBy.progress_apply:

import pandas as pd
import numpy as np
from tqdm import tqdm

df = pd.DataFrame(np.random.randint(0, int(1e8), (10000, 1000)))

# Create and register a new `tqdm` instance with `pandas`
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()

# Now you can use `progress_apply` instead of `apply`
df.groupby(0).progress_apply(lambda x: x**2)

In case you're interested in how this works (and how to modify it for your own callbacks), see the examples on github, the full documentation on pypi, or import the module and run help(tqdm).

EDIT


To directly answer the original question, replace:

df_users.groupby(['userID', 'requestDate']).apply(feature_rollup)

with:

from tqdm import tqdm
tqdm.pandas()
df_users.groupby(['userID', 'requestDate']).progress_apply(feature_rollup)

Note: tqdm <= v4.8: For versions of tqdm below 4.8, instead of tqdm.pandas() you had to do:

from tqdm import tqdm, tqdm_pandas
tqdm_pandas(tqdm())

这篇关于大 pandas 运营期间的进度指标(python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆