每日数据,每3天重新采样一次,然后连续5天进行有效计算 [英] daily data, resample every 3 days, calculate over trailing 5 days efficiently

查看:92
本文介绍了每日数据,每3天重新采样一次,然后连续5天进行有效计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑df

tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
df

我想每3天计算过去5天的总和.

我希望看起来像这样

此内容已编辑
我的不正确. @ivan_pozdeev和@boud注意到这是一个居中的窗口,这不是我的意图.混淆的应用程序.
每个人的解决方案都体现了我的追求.


条件

  • 我正在寻找可以扩展到大数据集的智能高效解决方案.

  • 我将成为计时解决方案,同时还要考虑优雅.

  • 解决方案还应该推广到各种样本和回溯频率.


来自评论

  • 我想要一种解决方案,该解决方案可以概括为处理指定频率的回溯并抓住该回溯内的所有内容.
    • 对于上面的示例,回溯为5D,并且可能有4或50个观测值属于该回溯.
  • 我希望时间戳记是回溯期内最后观察到的时间戳记.

解决方案

此处列出了两个 三个少数几个基于NumPy的解决方案,这些解决方案使用了基于bin的求和,基本上涵盖了三种情况. /p>

方案1:每个日期有多个条目,但没有丢失的日期

方法1:

# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app1(df):
    # Extract the index names and values
    vals = df.A.values
    indx = df.index.values

    # Extract IDs for bin based summing
    mask = np.append(False,indx[1:] > indx[:-1])
    date_id = mask.cumsum()
    search_id = np.hstack((0,np.arange(2,date_id[-1],3),date_id[-1]+1))
    shifts = np.searchsorted(date_id,search_id)
    reps = shifts[1:] - shifts[:-1]
    id_arr = np.repeat(np.arange(len(reps)),reps)

    # Perform bin based summing and subtract the repeated ones
    IDsums = np.bincount(id_arr,vals)
    allsums = IDsums[:-1] + IDsums[1:]
    allsums[1:] -= np.bincount(date_id,vals)[search_id[1:-2]]

    # Convert to pandas dataframe if needed
    out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
    return pd.DataFrame(allsums,index=out_index,columns=['A'])

方法2:

# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app2(df):
    # Extract the index names and values
    indx = df.index.values

    # Extract IDs for bin based summing
    mask = np.append(False,indx[1:] > indx[:-1])
    date_id = mask.cumsum()

    # Generate IDs at which shifts are to happen for a (2,3,5,8..) patttern    
    # Pad with 0 and length of array at either ends as we use diff later on
    shiftIDs = (np.arange(2,date_id[-1],3)[:,None] + np.arange(2)).ravel()
    search_id = np.hstack((0,shiftIDs,date_id[-1]+1))

    # Find the start of those shifting indices    
    # Generate ID based on shifts and do bin based summing of dataframe
    shifts = np.searchsorted(date_id,search_id)
    reps = shifts[1:] - shifts[:-1]
    id_arr = np.repeat(np.arange(len(reps)),reps)    
    IDsums = np.bincount(id_arr,df.A.values)

    # Sum each group of 3 elems with a stride of 2, make dataframe if needed
    allsums = IDsums[:-1:2] + IDsums[1::2] + IDsums[2::2]    

    # Convert to pandas dataframe if needed
    out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
    return pd.DataFrame(allsums,index=out_index,columns=['A'])

方法3:

def vectorized_app3(df, S=3, W=5):
    dt = df.index.values
    shifts = np.append(False,dt[1:] > dt[:-1])
    c = np.bincount(shifts.cumsum(),df.A.values)
    out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
    out_index = dt[np.nonzero(shifts)[0][W-2::S]]
    return pd.DataFrame(out,index=out_index,columns=['A'])

我们可以将卷积部分替换为直接切片求和,以获取其修改版本-

def vectorized_app3_v2(df, S=3, W=5):  
    dt = df.index.values
    shifts = np.append(False,dt[1:] > dt[:-1])
    c = np.bincount(shifts.cumsum(),df.A.values)
    f = c.size+S-W
    out = c[:f:S].copy()
    for i in range(1,W):
        out += c[i:f+i:S]
    out_index = dt[np.nonzero(shifts)[0][W-2::S]]
    return pd.DataFrame(out,index=out_index,columns=['A'])

场景2:每个日期和缺少的日期都有多个条目

方法4:

def vectorized_app4(df, S=3, W=5):
    dt = df.index.values
    indx = np.append(0,((dt[1:] - dt[:-1])//86400000000000).astype(int)).cumsum()
    WL = ((indx[-1]+1)//S)
    c = np.bincount(indx,df.A.values,minlength=S*WL+(W-S))
    out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
    grp0_lastdate = dt[0] + np.timedelta64(W-1,'D')
    freq_str = str(S)+'D'
    grp_last_dt = pd.date_range(grp0_lastdate, periods=WL, freq=freq_str).values
    out_index = dt[dt.searchsorted(grp_last_dt,'right')-1]
    return pd.DataFrame(out,index=out_index,columns=['A'])

方案3:连续的日期,每个日期只有一个条目

方法5:

def vectorized_app5(df, S=3, W=5):
    vals = df.A.values
    N = (df.shape[0]-W+2*S-1)//S
    n = vals.strides[0]
    out = np.lib.stride_tricks.as_strided(vals,shape=(N,W),\
                                        strides=(S*n,n)).sum(1)
    index_idx = (W-1)+S*np.arange(N)
    out_index = df.index[index_idx]
    return pd.DataFrame(out,index=out_index,columns=['A'])


创建测试数据的建议

场景1:

# Setup input for multiple dates, but no missing dates
S = 4 # Stride length (Could be edited)
W = 7 # Window length (Could be edited)
datasize = 3  # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
start_df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
reps = np.random.randint(1,4,(len(start_df)))
idx0 = np.repeat(start_df.index,reps)
df_data = np.random.randint(0,9,(len(idx0)))
df = pd.DataFrame(df_data,index=idx0,columns=['A'])

场景2:

要为多个日期和缺少日期的日期创建设置,我们只需编辑df_data创建步骤,就像这样-

df_data = np.random.randint(0,9,(len(idx0)))

场景3:

# Setup input for exactly one entry per date
S = 4 # Could be edited
W = 7
datasize = 3  # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)

consider the df

tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
df

I want to calculate the sum over a trailing 5 days, every 3 days.

I expect something that looks like this

this was edited
what I had was incorrect. @ivan_pozdeev and @boud noticed this was a centered window and that was not my intention. Appologies for the confusion.
everyone's solutions capture much of what I was after.


criteria

  • I'm looking for smart efficient solutions that can be scaled to large data sets.

  • I'll be timing solutions and also considering elegance.

  • Solutions should also be generalizable for a variety of sample and look back frequencies.


from comments

  • I want a solution that generalizes to handle a look back of a specified frequency and grab anything that falls within that look back.
    • for the sample above, the look back is 5D and there may be 4 or 50 observations that fall within that look back.
  • I want the timestamp to be the last observed timestamp within the look back period.

解决方案

Listed here are two three few NumPy based solutions using bin based summing covering basically three scenarios.

Scenario #1 : Multiple entries per date, but no missing dates

Approach #1 :

# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app1(df):
    # Extract the index names and values
    vals = df.A.values
    indx = df.index.values

    # Extract IDs for bin based summing
    mask = np.append(False,indx[1:] > indx[:-1])
    date_id = mask.cumsum()
    search_id = np.hstack((0,np.arange(2,date_id[-1],3),date_id[-1]+1))
    shifts = np.searchsorted(date_id,search_id)
    reps = shifts[1:] - shifts[:-1]
    id_arr = np.repeat(np.arange(len(reps)),reps)

    # Perform bin based summing and subtract the repeated ones
    IDsums = np.bincount(id_arr,vals)
    allsums = IDsums[:-1] + IDsums[1:]
    allsums[1:] -= np.bincount(date_id,vals)[search_id[1:-2]]

    # Convert to pandas dataframe if needed
    out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
    return pd.DataFrame(allsums,index=out_index,columns=['A'])

Approach #2 :

# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app2(df):
    # Extract the index names and values
    indx = df.index.values

    # Extract IDs for bin based summing
    mask = np.append(False,indx[1:] > indx[:-1])
    date_id = mask.cumsum()

    # Generate IDs at which shifts are to happen for a (2,3,5,8..) patttern    
    # Pad with 0 and length of array at either ends as we use diff later on
    shiftIDs = (np.arange(2,date_id[-1],3)[:,None] + np.arange(2)).ravel()
    search_id = np.hstack((0,shiftIDs,date_id[-1]+1))

    # Find the start of those shifting indices    
    # Generate ID based on shifts and do bin based summing of dataframe
    shifts = np.searchsorted(date_id,search_id)
    reps = shifts[1:] - shifts[:-1]
    id_arr = np.repeat(np.arange(len(reps)),reps)    
    IDsums = np.bincount(id_arr,df.A.values)

    # Sum each group of 3 elems with a stride of 2, make dataframe if needed
    allsums = IDsums[:-1:2] + IDsums[1::2] + IDsums[2::2]    

    # Convert to pandas dataframe if needed
    out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
    return pd.DataFrame(allsums,index=out_index,columns=['A'])

Approach #3 :

def vectorized_app3(df, S=3, W=5):
    dt = df.index.values
    shifts = np.append(False,dt[1:] > dt[:-1])
    c = np.bincount(shifts.cumsum(),df.A.values)
    out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
    out_index = dt[np.nonzero(shifts)[0][W-2::S]]
    return pd.DataFrame(out,index=out_index,columns=['A'])

We could replace the convolution part with direct sliced summation for a modified version of it -

def vectorized_app3_v2(df, S=3, W=5):  
    dt = df.index.values
    shifts = np.append(False,dt[1:] > dt[:-1])
    c = np.bincount(shifts.cumsum(),df.A.values)
    f = c.size+S-W
    out = c[:f:S].copy()
    for i in range(1,W):
        out += c[i:f+i:S]
    out_index = dt[np.nonzero(shifts)[0][W-2::S]]
    return pd.DataFrame(out,index=out_index,columns=['A'])

Scenario #2 : Multiple entries per date and missing dates

Approach #4 :

def vectorized_app4(df, S=3, W=5):
    dt = df.index.values
    indx = np.append(0,((dt[1:] - dt[:-1])//86400000000000).astype(int)).cumsum()
    WL = ((indx[-1]+1)//S)
    c = np.bincount(indx,df.A.values,minlength=S*WL+(W-S))
    out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
    grp0_lastdate = dt[0] + np.timedelta64(W-1,'D')
    freq_str = str(S)+'D'
    grp_last_dt = pd.date_range(grp0_lastdate, periods=WL, freq=freq_str).values
    out_index = dt[dt.searchsorted(grp_last_dt,'right')-1]
    return pd.DataFrame(out,index=out_index,columns=['A'])

Scenario #3 : Consecutive dates and exactly one entry per date

Approach #5 :

def vectorized_app5(df, S=3, W=5):
    vals = df.A.values
    N = (df.shape[0]-W+2*S-1)//S
    n = vals.strides[0]
    out = np.lib.stride_tricks.as_strided(vals,shape=(N,W),\
                                        strides=(S*n,n)).sum(1)
    index_idx = (W-1)+S*np.arange(N)
    out_index = df.index[index_idx]
    return pd.DataFrame(out,index=out_index,columns=['A'])


Suggestions for creating test-data

Scenario #1 :

# Setup input for multiple dates, but no missing dates
S = 4 # Stride length (Could be edited)
W = 7 # Window length (Could be edited)
datasize = 3  # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
start_df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
reps = np.random.randint(1,4,(len(start_df)))
idx0 = np.repeat(start_df.index,reps)
df_data = np.random.randint(0,9,(len(idx0)))
df = pd.DataFrame(df_data,index=idx0,columns=['A'])

Scenario #2 :

To create setup for multiple dates and with missing dates, we could just edit the df_data creation step, like so -

df_data = np.random.randint(0,9,(len(idx0)))

Scenario #3 :

# Setup input for exactly one entry per date
S = 4 # Could be edited
W = 7
datasize = 3  # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)

这篇关于每日数据,每3天重新采样一次,然后连续5天进行有效计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆