pandas 的更快替代品pivot_table [英] Faster alternatives to Pandas pivot_table
问题描述
我在大型数据集(1000万行,6列)上使用Pandas pivot_table
函数.由于执行时间至关重要,因此我尝试加快执行速度.目前处理整个数据集大约需要8秒钟的时间,这很慢,我希望找到其他方法来提高速度/性能.
I'm using Pandas pivot_table
function on a large dataset (10 million rows, 6 columns). As execution time is paramount, I try to speed up the process. Currently it takes around 8 secs to process the whole dataset which is way to slow and I hope to find alternatives to improve speed/performance.
我当前的熊猫数据透视表:
My current Pandas pivot_table:
df_pivot = df_original.pivot_table(index="industry", columns = "months",
values = ["orders", "client_name"],
aggfunc ={"orders": np.sum, "client_name": pd.Series.nunique})
df_original
包含所有数据(从CSV导入的10m行).行业是客户的行业,月份是订单月份(一月到十二月),订单是订单数量.除订单数( int
数据类型)外,所有数据均转换为分类
数据.最初是行业,月份和client_name是字符串.
df_original
includes all the data (10m rows, imported from a csv). Industry is the client's industry, months are the order months (Jan to Dec), orders are the number of orders. All data was converted to categorical
data, except number of orders (int
datatype). Originally industry, months and client_name were strings.
我尝试使用 pandas.DataFrame.unstack
-甚至更慢.我还尝试了 Dask
. daskivot_table
产生了一些改进(执行时间为6秒,因此减少了2秒).但是,它仍然很慢.是否有更快的替代方法(适用于大型数据集)?也许是用 groupy
, crosstab
重新创建枢纽分析表的... ...不幸的是,我根本没有其他替代方法可以工作,而且我对Python和Pandas还是很陌生...期待您的建议.预先感谢!
I tried using pandas.DataFrame.unstack
- which was even slower. Also I experimented with Dask
. The dask pivot_table
yielded some improvement (6 sec execution time - so 2 sec less). However, it is still pretty slow.
Are there any faster alternatives (for large datasets)? Maybe recreation of the pivot table with groupy
, crosstab
, ... Unfortunately, I did not get the alternatives to work at all and I am still quite new to Python and Pandas...
Looking forward to your suggestions. Thanks in advance!
更新:
我用以下方法弄清了分组方式:
I figured out the groupby way with:
df_new = df_original.groupby(["months", "industry"]).agg({"orders": np.sum, "client_name": pd.Series.nunique}).unstack(level="months").fillna(0)
现在快了大约2-3秒.还有一些选择可以进一步提高速度吗?
This is much faster now with about 2-3 secs. Are there still some options to improve speed further?
推荐答案
将月份和行业列转换为分类列: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html 这样可以避免很多字符串比较.
Convert the columns months and industry to categorical columns: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html This way you avoid a lot of string comparisons.
这篇关于 pandas 的更快替代品pivot_table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!