Python重采样无法索引的时间序列数据 [英] Python re-sampling time series data which can not be indexed

查看:35
本文介绍了Python重采样无法索引的时间序列数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题的目的是了解每秒发生"了多少笔交易(计数)以及交易的总数量(总和).

The purpose of this question is to know how many trades "happened" in each second (count) as well as the total volume traded (sum).

我有无法编入索引的时间序列数据(因为有多个条目具有相同的时间戳 - 可以在同一毫秒内进行多次交易),因此使用 resample 如此处所述无法工作.

I have time series data which can not be indexed (as there are multiply entries with the same time-stamp - can get many trades on the same millisecond) and therefor the use of resample as explained here can not work.

另一种方法是首先按时间分组,如图所示here (稍后每秒钟重新采样).问题是分组只会导致对分组项目的一个基本算术(我只能求和/平均值/标准等),而在此数据中,我需要将tradeVolume"列按总和分组,而ask1"列按平均值分组.

Another approach was to first to do group by time as shown here (and later to resample per seconds). The problem is that grouping will cause only one elementary arithmetic over the grouped items (I can only sum/mean/std etc.) while in this data, I would need for 'tradeVolume' column to be grouped by sum while column 'ask1' to be grouped by mean.

所以我的问题是:1. 如何group by 每列使用不同的算法如果不可能,还有没有其他方法可以将毫秒数据重新采样为没有日期时间索引的秒数.

So my question is/are: 1. how to group by with different arithmetic per column and if not possible is there any other way to resample the milliseconds data into seconds without the datetime index.

谢谢!

时间序列(样本)在这里:

The time series (sample) is here:

SecurityID,dateTime,ask1,ask1Volume,bid1,bid1Volume,ask2,ask2Volume,bid2,bid2Volume,ask3,ask3Volume,bid3,bid3Volume,tradePrice,tradeVolume,isTrade
2318276,2017-11-20 08:00:09.052240,12869.0,1,12868.0,3,12870.0,19,12867.5,2,12872.5,2,12867.0,1,0.0,0,0
2318276,2017-11-20 08:00:09.052260,12869.0,1,12868.0,3,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12861.0,1,1
2318276,2017-11-20 08:00:09.052260,12869.0,1,12868.0,2,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052270,12869.0,1,12868.0,2,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,1
2318276,2017-11-20 08:00:09.052270,12869.0,1,12868.0,1,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052282,12869.0,1,12868.0,1,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,1
2318276,2017-11-20 08:00:09.052282,12869.0,1,12867.5,2,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052291,12869.0,1,12867.5,2,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,1
2318276,2017-11-20 08:00:09.052291,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,0
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.0,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12865.5,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12865.0,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12864.0,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12861.5,2,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12864.0,1,0
2318276,2017-11-20 08:00:09.052335,12869.0,1,12861.5,2,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,1
2318276,2017-11-20 08:00:09.052335,12869.0,1,12861.5,1,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,0
2318276,2017-11-20 08:00:09.052348,12869.0,1,12861.5,1,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,1
2318276,2017-11-20 08:00:09.052348,12869.0,1,12861.0,1,12870.0,19,12860.0,5,12872.5,2,12859.5,3,12861.5,1,0
2318276,2017-11-20 08:00:09.052357,12869.0,1,12861.0,1,12870.0,19,12860.0,5,12872.5,2,12859.5,3,12861.0,1,1
2318276,2017-11-20 08:00:09.052357,12869.0,1,12860.0,5,12870.0,19,12859.5,3,12872.5,2,12858.0,1,12861.0,1,0

推荐答案

首先你需要有一列用于秒(自纪元以来),然后 groupby 使用该列,然后对您想要的列进行聚合.

First you need to have a column for the seconds (since epoch), then groupby using that column, and then do an aggregation on the columns you want.

您希望将时间戳降低到一秒的精度,并使用它进行分组.然后应用聚合以获得您需要的均值/总和/标准

You want to floor the timestamp down to one second accuracy, and group using that. Then apply an aggregation to get the mean/sum/std what ever you need

df = pd.read_csv('data.csv')
df['dateTime'] = df['dateTime'].astype('datetime64[s]')
groups = df.groupby('dateTime')
groups.agg({'ask1': np.mean, 'tradeVolume': np.sum}) 

我修改了数据以确保其中实际上有不同的秒数,

I modified the data to make sure there are actually different seconds in it,

SecurityID,dateTime,ask1,ask1Volume,bid1,bid1Volume,ask2,ask2Volume,bid2,bid2Volume,ask3,ask3Volume,bid3,bid3Volume,tradePrice,tradeVolume,isTrade
2318276,2017-11-20 08:00:09.052240,12869.0,1,12868.0,3,12870.0,19,12867.5,2,12872.5,2,12867.0,1,0.0,0,0
2318276,2017-11-20 08:00:09.052260,12869.0,1,12868.0,3,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12861.0,1,1
2318276,2017-11-20 08:00:09.052260,12869.0,1,12868.0,2,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052270,12869.0,1,12868.0,2,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,1
2318276,2017-11-20 08:00:09.052270,12869.0,1,12868.0,1,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052282,12869.0,1,12868.0,1,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,1
2318276,2017-11-20 08:00:09.052282,12869.0,1,12867.5,2,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052291,12869.0,1,12867.5,2,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,1
2318276,2017-11-20 08:00:09.052291,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,0
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.0,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12865.5,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12865.0,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12864.0,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12861.5,2,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12864.0,1,0
2318276,2017-11-20 08:00:10.052335,12869.0,1,12861.5,2,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,1
2318276,2017-11-20 08:00:10.052335,12869.0,1,12861.5,1,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,0
2318276,2017-11-20 08:00:10.052348,12869.0,1,12861.5,1,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,1
2318276,2017-11-20 08:00:10.052348,12869.0,1,12861.0,1,12870.0,19,12860.0,5,12872.5,2,12859.5,3,12861.5,1,0
2318276,2017-11-20 08:00:10.052357,12869.0,1,12861.0,1,12870.0,19,12860.0,5,12872.5,2,12859.5,3,12861.0,1,1
2318276,2017-11-20 08:00:10.052357,12869.0,1,12860.0,5,12870.0,19,12859.5,3,12872.5,2,12858.0,1,12861.0,1,0

和输出

In [53]: groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})
Out[53]: 
               ask1  tradeVolume
seconds                         
1511164809  12869.0           10
1511164810  12869.0           10

<小时>

脚注

OP 说原始版本(下)更快,所以我跑了一些时间

OP said that the original version (below) was faster, so I ran some timings

def test1(df):
    """This is the fastest and cleanest."""
    df['dateTime'] = df['dateTime'].astype('datetime64[s]')
    groups = df.groupby('dateTime')
    agg = groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})

def test2(df):
    """Totally unnecessary amount of datetime floors."""
    def group_by_second(index_loc):
        return df.loc[index_loc, 'dateTime'].floor('S')
    df['dateTime'] = df['dateTime'].astype('datetime64[ns]')
    groups = df.groupby(group_by_second)
    result = groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})

def test3(df):
    """Original version, but the conversion to/from nanoseconds is unnecessary."""
    df['dateTime'] = df['dateTime'].astype('datetime64[ns]')
    df['seconds'] = df['dateTime'].apply(lambda v: v.value // 1e9)
    groups = df.groupby('dateTime')
    agg = groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})

if __name__ == '__main__':
    import timeit
    print('22 rows')
    df = pd.read_csv('data_small.csv')
    print('test1', timeit.repeat("test1(df.copy())", number=50, globals=globals()))
    print('test2', timeit.repeat("test2(df.copy())", number=50, globals=globals()))
    print('test3', timeit.repeat("test3(df.copy())", number=50, globals=globals()))

    print('220 rows')
    df = pd.read_csv('data.csv')
    print('test1', timeit.repeat("test1(df.copy())", number=50, globals=globals()))
    print('test2', timeit.repeat("test2(df.copy())", number=50, globals=globals()))
    print('test3', timeit.repeat("test3(df.copy())", number=50, globals=globals()))

我在两个数据集上测试了它们,一个是第一个数据集的 10 倍,结果

I tested those on two datasets one 10 times the size of the first one, the results

22 rows
test1 [0.08138518501073122, 0.07786444900557399, 0.0775048139039427]
test2 [0.2644687460269779, 0.26298125297762454, 0.2618108610622585]
test3 [0.10624988097697496, 0.1028324980288744, 0.10304366517812014]
220 rows
test1 [0.07999306707642972, 0.07842653687112033, 0.07848454895429313]
test2 [1.9794962559826672, 1.966513831866905, 1.9625889619346708]
test3 [0.12691736104898155, 0.12642419710755348, 0.126510804053396]

因此,最好使用 .astype('datetime[s]') 版本,因为它速度最快且扩展性最好.

So, best to use the .astype('datetime[s]') version as that is the fastest and scales the best.

这篇关于Python重采样无法索引的时间序列数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆