为什么 pandas 这么疯狂?如何定义这样的功能? [英] Why is Pandas so madly fast? How to define such functions?
本文介绍了为什么 pandas 这么疯狂?如何定义这样的功能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
- 我尝试比较熊猫和传统循环的效果.我意识到,与传统循环相比,相同的输入和输出,熊猫的计算速度非常快.
- I tried comparing the performance of Pandas and the traditional loop. I realized that with the same input and output, Pandas performed terribly fast calculations compared to the traditional loop.
#df_1h has been imported before
import time
n = 14
pd.options.display.max_columns = 8
display("df_1h's Shape {} rows x {} columns".format(df_1h.shape[0], df_1h.shape[1]))
close = df_1h['close']
start = time.time()
df_1h['sma_14_pandas'] = close.rolling(14).mean()
end = time.time()
display('pandas: {}'.format(end - start))
start = time.time()
df_1h['sma_14_loop'] = np.nan
for i in range(n-1, df_1h.shape[0]):
df_1h['sma_14_loop'][i] = close[i-n+1:i+1].mean()
end = time.time()
display('loop: {}'.format(end - start))
display(df_1h.tail())
输出:
"df_1h's Shape 16598 rows x 15 columns"
'pandas: 0.0030088424682617188'
'loop: 7.2529966831207275'
open_time open high low ... ignore rsi_14 sma_14_pandas sma_14_loop
16593 1.562980e+12 11707.39 11739.90 11606.04 ... 0.0 51.813151 11646.625714 11646.625714
16594 1.562983e+12 11664.32 11712.61 11625.00 ... 0.0 49.952679 11646.834286 11646.834286
16595 1.562987e+12 11632.64 11686.47 11510.00 ... 0.0 47.583619 11643.321429 11643.321429
16596 1.562990e+12 11582.06 11624.04 11500.00 ... 0.0 48.725262 11644.912857 11644.912857
16597 1.562994e+12 11604.96 11660.00 11588.16 ... 0.0 50.797087 11656.723571 11656.723571
5 rows × 15 columns
- 熊猫几乎快于 2.5k倍 !!!!
- Pandas almost faster than 2.5k times!!!
- 我的代码是否错误?
- 如果我的代码正确,为什么熊猫这么快?
- 如何定义对熊猫运行如此快的自定义功能?
- Is my code wrong?
- If my code is correct, why is Pandas so fast?
- How to define custom functions that run so fast for Pandas?
- 您的代码在产生正确结果的意义上是正确的.通常,在数据帧的行上进行显式迭代通常不是一个好主意.通常,通过熊猫方法(如您所展示的那样)可以更有效地获得相同的结果.
- 熊猫之所以如此之快,是因为它在后台使用了numpy. Numpy实现了高效的阵列操作.另外,熊猫的原始创建者韦斯·麦金尼(Wes McKinney)有点着迷于效率和速度.
- 使用numpy或其他优化的库.我建议阅读熊猫文档的提高性能部分.如果您不能使用内置的pandas方法,并且通常有意义的是检索数据框或序列的numpy表示形式(使用
value
属性或to_numpy()
方法),则仅对numpy数组进行所有计算然后将结果存储回数据框或序列中. - Your code is correct in the sense that it produces the correct result. Explicitely iterating over the rows of a dataframe is as a rule however not so good an idea in terms of performance. Most often the same result can be achieved far more efficiently by pandas methods (as you demonstrated yourself).
- Pandas is so fast because it uses numpy under the hood. Numpy implements highly efficient array operations. Also, the original creator of pandas, Wes McKinney, is kinda obsessed with efficiency and speed.
- Use numpy or other optimized libraries. I recommend reading the Enhancing performance section of the pandas docs. If you can't use built-in pandas methods, if often makes sense to retrieve a numpy respresentation of the dataframe or series (using the
value
attribute orto_numpy()
method), do all the calculations on the numpy array and only then store the result back to the dataframe or series.
推荐答案
关于您的三个问题:
为什么循环算法这么慢?
在循环算法中,mean
计算超过16500次,每次加起来有14个元素来求平均值.熊猫的rolling
方法使用了更复杂的方法,大大减少了算术运算的数量.
Why is the loop algorithm so slow?
In your loop algorithm, mean
is calculated over 16500 times, each time adding up 14 elements to find the mean. Pandas' rolling
method uses a more sophisticated approach, heavily reducing the number of arithmetic operations.
如果您使用numpy进行计算,则可以获得与熊猫类似的性能(实际上,它们的性能比熊猫好3倍).在以下示例中对此进行了说明:
You can achieve similar (and in fact about 3 times better) performance than pandas if you do the calculations in numpy. This is illustrated in the following example:
import pandas as pd
import numpy as np
import time
data = np.random.uniform(10000,15000,16598)
df_1h = pd.DataFrame(data, columns=['Close'])
close = df_1h['Close']
n = 14
print("df_1h's Shape {} rows x {} columns".format(df_1h.shape[0], df_1h.shape[1]))
start = time.time()
df_1h['SMA_14_pandas'] = close.rolling(14).mean()
print('pandas: {}'.format(time.time() - start))
start = time.time()
df_1h['SMA_14_loop'] = np.nan
for i in range(n-1, df_1h.shape[0]):
df_1h['SMA_14_loop'][i] = close[i-n+1:i+1].mean()
print('loop: {}'.format(time.time() - start))
def np_sma(a, n=14) :
ret = np.cumsum(a)
ret[n:] = ret[n:] - ret[:-n]
return np.append([np.nan]*(n-1), ret[n-1:] / n)
start = time.time()
df_1h['SMA_14_np'] = np_sma(close.values)
print('np: {}'.format(time.time() - start))
assert np.allclose(df_1h.SMA_14_loop.values, df_1h.SMA_14_pandas.values, equal_nan=True)
assert np.allclose(df_1h.SMA_14_loop.values, df_1h.SMA_14_np.values, equal_nan=True)
输出:
df_1h's Shape 16598 rows x 1 columns
pandas: 0.0031278133392333984
loop: 7.605962753295898
np: 0.0010571479797363281
这篇关于为什么 pandas 这么疯狂?如何定义这样的功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文