Python pandas 重新采样 [英] Python pandas resampling

查看:63
本文介绍了Python pandas 重新采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据框:

    Timestamp    S_time1   S_time2   End_Time_1   End_time_2   Sign_1   Sign_2
0    2413044       0        0           0            0          x        x
1    2422476       0        0           0            0          x        x
2    2431908       0        0           0            0          x        x
3    2441341       0        0           0            0          x        x
4    2541232   2526631   2528631     2520631      2530631      10       80
5    2560273   2544946   2546496     2546496      2548496      40       80
6    2577224   2564010   2566010     2566010      2568010     null    null
7    2592905   2580959   2582959     2582959      2584959     null    null

桌子就这样继续下去.第一列是以毫秒为单位的时间戳. S_time1End_time_1是出现特定符号(数字)的持续时间.例如,如果我们走第5行,S_time1是2526631,End_time_1是2520631,相应的sign_1是10,这意味着从2526631到2520631,将显示符号10.同样,S_time2End_time_2也是如此. sign_2中的相应值将出现在从S_time2End_time_2的持续时间内.

The table goes on and on like that. The first column is a timestamp which is in milliseconds. S_time1 and End_time_1 are the duration where a particular sign (number) appear. For example, if we take the 5th row, S_time1 is 2526631, End_time_1 is 2520631, and the corresponding sign_1 is 10, which means from 2526631 to 2520631 the sign 10 will be displayed. And the same thing goes to S_time2 and End_time_2. The corresponding values in sign_2 will appear in the duration from S_time2 to End_time_2.

我想在100毫秒的bin时间中对索引列(Timestamp)进行重新采样,并检查符号属于哪个bin时间.例如,每个开始时间和结束时间之间存在2000毫秒的差异.因此,相应的符号号将在大约20个连续的bin时间中重复出现,因为每个bin时间为100毫秒.因此,我只需要有两列:一列带有bin时间,第二列带有符号.如下表所示:(例如,我只是在补充bin时间)

I want to resample the index column (Timestamp) in 100-millisecond bin time and check in which bin times the signs belong. For instance, between each start time and end time there is 2000 milliseconds difference. So the corresponding sign number will appear repeatedly in around 20 consecutive bin times because each bin time is 100 millisecond. So I need to have two columns only: one with the bin times and the second with the signs. Looks like the following table: (I am just making up the bin time just for example)

Bin_time   signs
...100        0
...200        0
...300        10
...400        10
...500        10
...600        10

符号10将在相应的S_time1到End_time_1的持续时间内.然后,下一个符号80在S_time2到End_time_2的持续时间内继续.我不确定这是否可以在熊猫中完成.但是我真的需要大熊猫或其他方法的帮助.

The sign 10 will be for the duration of the corresponding S_time1 to End_time_1. Then the next sign which is 80 continues for the duration of S_time2 to End_time_2. I am not sure if this can be done in pandas or not. But I really need help either in pandas or other methods.

感谢您的帮助和建议.

推荐答案

输入:

print df
  Timestamp  S_time1  S_time2  End_Time_1  End_time_2 Sign_1 Sign_2
0    2413044        0        0           0           0      x      x
1    2422476        0        0           0           0      x      x
2    2431908        0        0           0           0      x      x
3    2441341        0        0           0           0      x      x
4    2541232  2526631  2528631     2520631     2530631     10     80
5    2560273  2544946  2546496     2546496     2548496     40     80
6    2577224  2564010  2566010     2566010     2568010   null   null
7    2592905  2580959  2582959     2582959     2584959   null   null

2种方法:

In [231]: %timeit s(df)
1 loops, best of 3: 2.78 s per loop

In [232]: %timeit m(df)
1 loops, best of 3: 690 ms per loop

def m(df):
    #resample column Timestamp by 100ms, convert bak to integers 
    df['Timestamp'] = df['Timestamp'].astype('timedelta64[ms]')
    df['i'] = 1
    df = df.set_index('Timestamp')
    df1 = df[[]].resample('100ms', how='first').reset_index()
    df1['Timestamp'] = (df1['Timestamp'] / np.timedelta64(1, 'ms')).astype(int)
    #felper column i for merging
    df1['i'] = 1
    #print df1

    out = df1.merge(df,on='i', how='left')
    out1 = out[['Timestamp', 'Sign_1']][(out.Timestamp >= out.S_time1) & (out.Timestamp <= out.End_Time_1)]
    out2 = out[['Timestamp', 'Sign_2']][(out.Timestamp >= out.S_time2) & (out.Timestamp <= out.End_time_2)]

    out1 = out1.rename(columns={'Sign_1':'Bin_time'})
    out2 = out2.rename(columns={'Sign_2':'Bin_time'})

    df = pd.concat([out1, out2], ignore_index=True).drop_duplicates(subset='Timestamp')
    df1 = df1.set_index('Timestamp')
    df = df.set_index('Timestamp')
    df = df.reindex(df1.index).reset_index()
    #print df.head(10)

def s(df):
    #resample column Timestamp by 100ms, convert bak to integers 
    df['Timestamp'] = df['Timestamp'].astype('timedelta64[ms]')
    df = df.set_index('Timestamp')
    out = df[[]].resample('100ms', how='first')
    out = out.reset_index()
    out['Timestamp'] = (out['Timestamp'] / np.timedelta64(1, 'ms')).astype(int)
    #print out.head(10)

    #search start end 
    def search(x):
        mask1 = (df.S_time1<=x['Timestamp']) & (df.End_Time_1>=x['Timestamp'])
        #if at least one True return first value of series
        if mask1.any():
                return df.loc[mask1].Sign_1[0]
        #check second start and end time
        else:
                mask2 = (df.S_time2<=x['Timestamp']) & (df.End_time_2>=x['Timestamp'])
                if mask2.any():
                    #if at least one True return first value
                    return df.loc[mask2].Sign_2[0]
                else:
                    #if all False return NaN
                    return np.nan

    out['Bin_time'] = out.apply(search, axis=1)
    #print out.head(10)

这篇关于Python pandas 重新采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆