pandas 函数太慢-用dict/numpy优化吗? [英] Pandas functions too slow - optimise with dict/numpy?
问题描述
我有约10个大型df(每行5百万+行,并且还在不断增长),我想对其进行计算.即使在超快速的AWS机器上使用原始大熊猫进行操作,也是如此缓慢.我需要的大多数功能都是基本功能,因此我认为可以将熊猫导出到dict(?),运行计算然后将其发送回df?
I have ~10 large df's (5mil+ rows each and growing) that I want to perform calculations on. Doing so with raw pandas even on a super fast AWS machine is unbearably slow. Most functions I need are basic so I'm thinking it would be possible to export from pandas to a dict(?), run my calculations then send it back to a df?
原始df只是按照以下方式捕获交易的价格和大小(提到的数百万行).
The original df is simply a price and size capture of trades as per below (millions of lines as mentioned).
size price
time
2018-05-18 12:05:11.521 -0.026600 8100.000000
2018-05-18 12:05:11.674 -0.115616 8100.000000
2018-05-18 12:05:11.677 -0.026611 8100.000000
2018-05-18 12:05:11.678 -0.074000 8098.400000
2018-05-18 12:05:11.680 -0.783772 8096.600000
2018-05-18 12:05:11.807 -1.000000 8096.600000
2018-05-18 12:05:12.024 -0.100600 8096.600000
2018-05-18 12:05:12.198 -0.899400 8096.600000
2018-05-18 12:05:12.199 -1.600600 8095.100000
2018-05-18 12:05:14.949 1.000000 8092.600000
2018-05-18 12:05:14.951 0.258350 8092.600000
2018-05-18 12:05:30.191 -0.017330 8092.500000
2018-05-18 12:05:30.192 -0.161670 8088.300000
2018-05-18 12:05:30.712 -0.002000 8088.300000
2018-05-18 12:05:30.773 -0.002000 8088.300000
2018-05-18 12:05:34.688 0.003328 8088.400000
现在,我想应用以下内容(将数百万行的聚合汇总到5秒的窗口中):
Now I want to apply the following (which acts as an aggregation of the millions of rows to 5 second windows) :
df = df.groupby(pd.Grouper(freq='5S')).apply(tick_features).shift()[1:]
其中tick_features()
是:
def tick_features(x):
if not x.empty:
open = x['price'].iloc[0]
close = x['price'].iloc[-1]
else:
open = np.nan
close = np.nan
high = x['price'].max()
low = x['price'].min()
volume = np.abs(x['size']).sum()
buy_volume = x['size'][x['size'] > 0].sum()
sell_volume = np.abs(x['size'][x['size'] < 0].sum())
pct_buy_volume = (buy_volume) / ((buy_volume) + (sell_volume))
pct_sell_volume = (sell_volume) / ((buy_volume) + (sell_volume))
num_trades = x['size'].count()
num_buy_trades = (x['size'] > 0).sum()
num_sell_trades = (x['size'] < 0).sum()
pct_buy_trades = (x['size'] > 0).mean() * 100
pct_sell_trades = (x['size'] < 0).mean() * 100
return pd.Series([open,high,low,close,volume,buy_volume,sell_volume,pct_buy_volume,pct_sell_volume,
num_trades,num_buy_trades,num_sell_trades,pct_buy_trades,pct_sell_trades],
index=['open','high','low','close','volume','buy_volume','sell_volume','pct_buy_volume','pct_sell_volume',
'num_trades','num_buy_trades','num_sell_trades','pct_buy_trades','pct_sell_trades'])
这种类型的优化超出了我的能力,因此,如果对此表示赞赏,请从任何解释入手.
This type of optimization is out of my league so any explanation if this is even feasible to begin with if much appreciated.
推荐答案
代码很慢,因为有这么多的组,对于每个组,Pandas需要创建一个DataFrame对象并将其传递给tick_features()
,循环是用Python执行.
The code is slow because there are so many groups and for every group, Pandas need to create a DataFrame object and pass it to tick_features()
, the loop is executed in Python.
要加快计算速度,可以调用在Cython循环中执行的聚合方法:
To speedup the calculation, you can call aggregation methods that executed in Cython loop:
首先准备一些虚拟数据:
Prepare some dummy data first:
import pandas as pd
import numpy as np
idx = pd.date_range("2018-05-01", "2018-06-02", freq="0.1S")
x = np.random.randn(idx.shape[0], 2)
df = pd.DataFrame(x, index=idx, columns=["size", "price"])
向其中添加额外的列,如果有足够的内存,则计算速度很快:
add extra columns to it, the calculation is fast, if you have enough memory:
df["time"] = df.index
df["volume"] = df["size"].abs()
df["buy_volume"] = np.clip(df["size"], 0, np.inf)
df["sell_volume"] = np.clip(df["size"], -np.inf, 0)
df["buy_trade"] = df["size"] > 0
df["sell_trade"] = df["size"] < 0
然后首先对DataFrame对象进行分组,并调用聚合方法:
then group the DataFrame object first, and call aggregation methods:
g = df.groupby(pd.Grouper(freq="5s"))
df2 = pd.DataFrame(
dict(
open = g["time"].first(),
close = g["time"].last(),
high = g["price"].max(),
low = g["price"].min(),
volume = g["volume"].sum(),
buy_volume = g["buy_volume"].sum(),
sell_volume = -g["sell_volume"].sum(),
num_trades = g["size"].count(),
buy_trade = g["buy_trade"].sum(),
sell_trade = g["sell_trade"].sum(),
pct_buy_trades = g["buy_trade"].mean() * 100,
pct_sell_trades = g["sell_trade"].mean() * 100,
)
)
d = df2.eval("buy_volume + sell_volume")
df2["pct_buy_volume"] = df2.eval("buy_volume / @d")
df2["pct_sell_volume"] = df2.eval("sell_volume / @d")
这篇关于 pandas 函数太慢-用dict/numpy优化吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!