为什么 pandas 这么疯狂?如何定义这样的功能? [英] Why is Pandas so madly fast? How to define such functions?

查看:74
本文介绍了为什么 pandas 这么疯狂?如何定义这样的功能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  • 我尝试比较熊猫传统循环的效果.我意识到,与传统循环相比,相同的输入和输出熊猫的计算速度非常快.
  • I tried comparing the performance of Pandas and the traditional loop. I realized that with the same input and output, Pandas performed terribly fast calculations compared to the traditional loop.
#df_1h has been imported before

import time

n = 14
pd.options.display.max_columns = 8
display("df_1h's Shape {} rows x {} columns".format(df_1h.shape[0], df_1h.shape[1]))

close = df_1h['close']

start = time.time()
df_1h['sma_14_pandas'] = close.rolling(14).mean()
end = time.time()
display('pandas: {}'.format(end - start))

start = time.time()
df_1h['sma_14_loop'] = np.nan
for i in range(n-1, df_1h.shape[0]):
    df_1h['sma_14_loop'][i] = close[i-n+1:i+1].mean()
end = time.time()
display('loop: {}'.format(end - start))

display(df_1h.tail())

输出:

"df_1h's Shape 16598 rows x 15 columns"
'pandas: 0.0030088424682617188'
'loop: 7.2529966831207275'
        open_time       open        high        low         ... ignore  rsi_14  sma_14_pandas   sma_14_loop
16593   1.562980e+12    11707.39    11739.90    11606.04    ... 0.0 51.813151   11646.625714    11646.625714
16594   1.562983e+12    11664.32    11712.61    11625.00    ... 0.0 49.952679   11646.834286    11646.834286
16595   1.562987e+12    11632.64    11686.47    11510.00    ... 0.0 47.583619   11643.321429    11643.321429
16596   1.562990e+12    11582.06    11624.04    11500.00    ... 0.0 48.725262   11644.912857    11644.912857
16597   1.562994e+12    11604.96    11660.00    11588.16    ... 0.0 50.797087   11656.723571    11656.723571
5 rows × 15 columns

  • 熊猫几乎快于 2.5k倍 !!!!
    • Pandas almost faster than 2.5k times!!!
      • 我的代码是否错误?
      • 如果我的代码正确,为什么熊猫这么快?
      • 如何定义对熊猫运行如此快的自定义功能?
      • Is my code wrong?
      • If my code is correct, why is Pandas so fast?
      • How to define custom functions that run so fast for Pandas?

      推荐答案

      关于您的三个问题:

      1. 您的代码在产生正确结果的意义上是正确的.通常,在数据帧的行上进行显式迭代通常不是一个好主意.通常,通过熊猫方法(如您所展示的那样)可以更有效地获得相同的结果.
      2. 熊猫之所以如此之快,是因为它在后台使用了numpy. Numpy实现了高效的阵列操作.另外,熊猫的原始创建者韦斯·麦金尼(Wes McKinney)有点着迷于效率和速度.
      3. 使用numpy或其他优化的库.我建议阅读熊猫文档的提高性能部分.如果您不能使用内置的pandas方法,并且通常有意义的是检索数据框或序列的numpy表示形式(使用value属性或to_numpy()方法),则仅对numpy数组进行所有计算然后将结果存储回数据框或序列中.
      1. Your code is correct in the sense that it produces the correct result. Explicitely iterating over the rows of a dataframe is as a rule however not so good an idea in terms of performance. Most often the same result can be achieved far more efficiently by pandas methods (as you demonstrated yourself).
      2. Pandas is so fast because it uses numpy under the hood. Numpy implements highly efficient array operations. Also, the original creator of pandas, Wes McKinney, is kinda obsessed with efficiency and speed.
      3. Use numpy or other optimized libraries. I recommend reading the Enhancing performance section of the pandas docs. If you can't use built-in pandas methods, if often makes sense to retrieve a numpy respresentation of the dataframe or series (using the value attribute or to_numpy() method), do all the calculations on the numpy array and only then store the result back to the dataframe or series.

      为什么循环算法这么慢?

      在循环算法中,mean计算超过16500次,每次加起来有14个元素来求平均值.熊猫的rolling方法使用了更复杂的方法,大大减少了算术运算的数量.

      Why is the loop algorithm so slow?

      In your loop algorithm, mean is calculated over 16500 times, each time adding up 14 elements to find the mean. Pandas' rolling method uses a more sophisticated approach, heavily reducing the number of arithmetic operations.

      如果您使用numpy进行计算,则可以获得与熊猫类似的性能(实际上,它们的性能比熊猫好3倍).在以下示例中对此进行了说明:

      You can achieve similar (and in fact about 3 times better) performance than pandas if you do the calculations in numpy. This is illustrated in the following example:

      import pandas as pd
      import numpy as np
      import time
      
      data = np.random.uniform(10000,15000,16598)
      df_1h = pd.DataFrame(data, columns=['Close'])
      close = df_1h['Close']
      n = 14
      print("df_1h's Shape {} rows x {} columns".format(df_1h.shape[0], df_1h.shape[1]))
      
      start = time.time()
      df_1h['SMA_14_pandas'] = close.rolling(14).mean()
      print('pandas: {}'.format(time.time() - start))
      
      start = time.time()
      df_1h['SMA_14_loop'] = np.nan
      for i in range(n-1, df_1h.shape[0]):
          df_1h['SMA_14_loop'][i] = close[i-n+1:i+1].mean()
      print('loop:   {}'.format(time.time() - start))
      
      def np_sma(a, n=14) :
          ret = np.cumsum(a)
          ret[n:] = ret[n:] - ret[:-n]
          return np.append([np.nan]*(n-1), ret[n-1:] / n)
      
      start = time.time()
      df_1h['SMA_14_np'] = np_sma(close.values)
      print('np:     {}'.format(time.time() - start))
      
      assert np.allclose(df_1h.SMA_14_loop.values, df_1h.SMA_14_pandas.values, equal_nan=True)
      assert np.allclose(df_1h.SMA_14_loop.values, df_1h.SMA_14_np.values, equal_nan=True)
      

      输出:

      df_1h's Shape 16598 rows x 1 columns
      pandas: 0.0031278133392333984
      loop:   7.605962753295898
      np:     0.0010571479797363281
      

      这篇关于为什么 pandas 这么疯狂?如何定义这样的功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆