滚动操作会降低性能以创建新列 [英] Rolling operation slow performance to create a new column

查看:92
本文介绍了滚动操作会降低性能以创建新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  • 原始数据库与此类似(尽管更大):

  • The original database is similar to this (although much larger):

idx = [np.array(['Jan', 'Jan', 'Feb', 'Mar', 'Mar', 'Mar','Apr', 'Apr', 'May', 'Jun', 'Jun', 'Jun','Jul', 'Aug', 'Aug', 'Sep', 'Sep', 'Oct','Oct', 'Oct', 'Nov', 'Dic', 'Dic',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'A', 'B', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'x': 1}, {'x': 5}, {'x': 3}, {'x': 2}, {'x': 7}, {'x': 3},{'x': 1}, {'x': 6}, {'x': 3}, {'x': 5}, {'x': 2}, {'x': 3},{'x': 1}, {'x': 9}, {'x': 3}, {'x': 2}, {'x': 7}, {'x': 3}, {'x': 6}, {'x': 8}, {'x': 2}, {'x': 7}, {'x': 9}]
df = pd.DataFrame(data, index=idx, columns=['x'])
df.index.names=['date','type']

它看起来像这样:

           x
date type
Jan  A     1
     B     5
Feb  B     3
Mar  A     2
     B     7
     C     3
Apr  A     1
     B     6
May  B     3
Jun  A     5
     B     2
     C     3
Jul  A     1
Aug  B     9
     C     3
Sep  A     2
     B     7
Oct  C     3
     A     6
     B     8
Nov  A     2
Dic  B     7
     C     9

  • 我的目标是改进以下代码以在数据框中创建一个新列(滚动移动平均权重不同).我的代码是:

    • My goal is to improve the following code to create a new column in the dataframe (rolling moving average with different weights). My code is:

      df=df.reset_index()
      df['rolling']=0
      for j in df['type'].unique():
          list_1=list(df['x'][df['type']==j])
          cumsum = [0]
          list_2=list(df['x'][df['type']==j].index)
          z=[]
          for i, h in enumerate(list_1, 1):
              if i>=4:
                cumsum.append(0.2*list_1[i-4]+0.3*list_1[i-3]+0.5*list_1[i-2])
              else:
                cumsum.append('NaN')
              cumsum.pop(0)
              z.append(cumsum[0])
          df['rolling'][list_2]=z
      

    • 它看起来像这样:

         date type  x rolling
      0   Jan    A  1     NaN
      1   Jan    B  5     NaN
      2   Feb    B  3     NaN
      3   Mar    A  2     NaN
      4   Mar    B  7     NaN
      5   Mar    C  3     NaN
      6   Apr    A  1     NaN
      7   Apr    B  6     5.4
      8   May    B  3     5.7
      9   Jun    A  5     1.3
      10  Jun    B  2     4.7
      11  Jun    C  3     NaN
      12  Jul    A  1     3.2
      13  Aug    B  9     3.1
      14  Aug    C  3     NaN
      15  Sep    A  2     2.2
      16  Sep    B  7     5.7
      17  Oct    C  3       3
      18  Oct    A  6     2.3
      19  Oct    B  8     6.6
      20  Nov    A  2     3.8
      21  Dic    B  7     7.9
      22  Dic    C  9       3
      

      **如果您的代码具有比我的代码更好的性能,那么知道它要快多少会很有趣.如果您认为您的代码更好,但是您不知道它的执行速度有多快,则无论如何都要发布它,因为我将意识到使用更大的数据框.谢谢!

      ** If you have a code that has a better performance than mine, it would be interesting to know how much faster it is. If you think your code is better, but you don't know how much faster it is, post it anyways because I will realize with a larger dataframe. Thanks!

      推荐答案

      让我们尝试一下,看看这样做是否可以加快代码的速度:

      Let's try this to see if this speed up your code any:

      idx = [np.array(['Jan', 'Jan', 'Feb', 'Mar', 'Mar', 'Mar','Apr', 'Apr', 'May', 'Jun', 'Jun', 'Jun','Jul', 'Aug', 'Aug', 'Sep', 'Sep', 'Oct','Oct', 'Oct', 'Nov', 'Dic', 'Dic',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'A', 'B', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
      data = [{'x': 1}, {'x': 5}, {'x': 3}, {'x': 2}, {'x': 7}, {'x': 3},{'x': 1}, {'x': 6}, {'x': 3}, {'x': 5}, {'x': 2}, {'x': 3},{'x': 1}, {'x': 9}, {'x': 3}, {'x': 2}, {'x': 7}, {'x': 3}, {'x': 6}, {'x': 8}, {'x': 2}, {'x': 7}, {'x': 9}]
      df = pd.DataFrame(data, index=idx, columns=['x'])
      df.index.names=['date','type']
      
      df['rolling'] = df.groupby('type')['x'].rolling(4).apply(lambda x: x[-4]*.2 + x[-3]*.3 + x[-2]*.5, raw=True)\
                        .reset_index(level=2, drop=True).swaplevel(0,1)
      
      df
      

      输出:

                 x  rolling
      date type            
      Jan  A     1      NaN
           B     5      NaN
      Feb  B     3      NaN
      Mar  A     2      NaN
           B     7      NaN
           C     3      NaN
      Apr  A     1      NaN
           B     6      5.4
      May  B     3      5.7
      Jun  A     5      1.3
           B     2      4.7
           C     3      NaN
      Jul  A     1      3.2
      Aug  B     9      3.1
           C     3      NaN
      Sep  A     2      2.2
           B     7      5.7
      Oct  C     3      3.0
           A     6      2.3
           B     8      6.6
      Nov  A     2      3.8
      Dic  B     7      7.9
           C     9      3.0
      

      时间......

      您的代码:

      每个循环324 ms±1.55 ms(平均±标准偏差,共7次运行,每个循环1次)

      324 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

      此代码:

      每个循环12.6 ms±138 µs(平均±标准偏差,共运行7次,每个循环100个循环)

      12.6 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

      这篇关于滚动操作会降低性能以创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆