pandas 的更快替代品pivot_table [英] Faster alternatives to Pandas pivot_table

查看:121
本文介绍了 pandas 的更快替代品pivot_table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在大型数据集(1000万行,6列)上使用Pandas pivot_table 函数.由于执行时间至关重要,因此我尝试加快执行速度.目前处理整个数据集大约需要8秒钟的时间,这很慢,我希望找到其他方法来提高速度/性能.

I'm using Pandas pivot_table function on a large dataset (10 million rows, 6 columns). As execution time is paramount, I try to speed up the process. Currently it takes around 8 secs to process the whole dataset which is way to slow and I hope to find alternatives to improve speed/performance.

我当前的熊猫数据透视表:

My current Pandas pivot_table:

df_pivot = df_original.pivot_table(index="industry", columns = "months",
                    values = ["orders", "client_name"],
                    aggfunc ={"orders": np.sum, "client_name": pd.Series.nunique})

df_original 包含所有数据(从CSV导入的10m行).行业是客户的行业,月份是订单月份(一月到十二月),订单是订单数量.除订单数( int 数据类型)外,所有数据均转换为分类数据.最初是行业,月份和client_name是字符串.

df_original includes all the data (10m rows, imported from a csv). Industry is the client's industry, months are the order months (Jan to Dec), orders are the number of orders. All data was converted to categorical data, except number of orders (int datatype). Originally industry, months and client_name were strings.

我尝试使用 pandas.DataFrame.unstack -甚至更慢.我还尝试了 Dask . daskivot_table 产生了一些改进(执行时间为6秒,因此减少了2秒).但是,它仍然很慢.是否有更快的替代方法(适用于大型数据集)?也许是用 groupy crosstab 重新创建枢纽分析表的... ...不幸的是,我根本没有其他替代方法可以工作,而且我对Python和Pandas还是很陌生...期待您的建议.预先感谢!

I tried using pandas.DataFrame.unstack - which was even slower. Also I experimented with Dask. The dask pivot_table yielded some improvement (6 sec execution time - so 2 sec less). However, it is still pretty slow. Are there any faster alternatives (for large datasets)? Maybe recreation of the pivot table with groupy, crosstab, ... Unfortunately, I did not get the alternatives to work at all and I am still quite new to Python and Pandas... Looking forward to your suggestions. Thanks in advance!

更新:

我用以下方法弄清了分组方式:

I figured out the groupby way with:

df_new = df_original.groupby(["months", "industry"]).agg({"orders": np.sum, "client_name": pd.Series.nunique}).unstack(level="months").fillna(0)

现在快了大约2-3秒.还有一些选择可以进一步提高速度吗?

This is much faster now with about 2-3 secs. Are there still some options to improve speed further?

推荐答案

将月份和行业列转换为分类列: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html 这样可以避免很多字符串比较.

Convert the columns months and industry to categorical columns: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html This way you avoid a lot of string comparisons.

这篇关于 pandas 的更快替代品pivot_table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆