修改 pandas 中的时间戳以使索引唯一 [英] Modifying timestamps in pandas to make index unique

查看:88
本文介绍了修改 pandas 中的时间戳以使索引唯一的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理不定期记录的财务数据.一些时间戳是重复的,这使分析变得棘手.这是数据的示例-请注意有四个2016-08-23 00:00:17.664193时间戳:

I'm working with financial data, which is recorded at irregular intervals. Some of the timestamps are duplicates, which is making analysis tricky. This is an example of the data - note there are four 2016-08-23 00:00:17.664193 timestamps:

In [167]: ts
Out[168]: 
                               last  last_sz      bid      ask
datetime                                                      
2016-08-23 00:00:14.161128  2170.75        1  2170.75  2171.00
2016-08-23 00:00:14.901180  2171.00        1  2170.75  2171.00
2016-08-23 00:00:17.196639  2170.75        1  2170.75  2171.00
2016-08-23 00:00:17.664193  2171.00        1  2170.75  2171.00
2016-08-23 00:00:17.664193  2171.00        1  2170.75  2171.00
2016-08-23 00:00:17.664193  2171.00        2  2170.75  2171.00
2016-08-23 00:00:17.664193  2171.00        1  2170.75  2171.00
2016-08-23 00:00:26.206108  2170.75        2  2170.75  2171.00
2016-08-23 00:00:28.322456  2170.75        7  2170.75  2171.00
2016-08-23 00:00:28.322456  2170.75        1  2170.75  2171.00

在此示例中,只有几个重复项,但是在某些情况下,有数百个连续的行,所有行共享相同的时间戳.我的目的是通过为每个重复项增加1纳秒来解决此问题(因此,如果连续4个相同的时间戳记,我将在第二个时间戳记中添加1ns,在第三个时间戳记中添加2ns,在第四个时间戳记中添加3ns.例如,上面的数据将被转换为:

In this example, there are only a few duplicates, but in some cases, there are hundreds of consecutive rows, all sharing the same timestamp. I'm aiming to solve this by adding 1 extra nanosecond to each duplicate (so in the case of 4 consecutive identical timestamps, I'd add 1ns to the second, 2ns to the 3rd, and 3ns to the fourth. For example, the data above would be converted to:

In [169]: make_timestamps_unique(ts)
Out[170]:
                                  last  last_sz      bid     ask
newindex                                                        
2016-08-23 00:00:14.161128000  2170.75        1  2170.75  2171.0
2016-08-23 00:00:14.901180000  2171.00        1  2170.75  2171.0
2016-08-23 00:00:17.196639000  2170.75        1  2170.75  2171.0
2016-08-23 00:00:17.664193000  2171.00        1  2170.75  2171.0
2016-08-23 00:00:17.664193001  2171.00        1  2170.75  2171.0
2016-08-23 00:00:17.664193002  2171.00        2  2170.75  2171.0
2016-08-23 00:00:17.664193003  2171.00        1  2170.75  2171.0
2016-08-23 00:00:26.206108000  2170.75        2  2170.75  2171.0
2016-08-23 00:00:28.322456000  2170.75        7  2170.75  2171.0
2016-08-23 00:00:28.322456001  2170.75        1  2170.75  2171.0

我一直在努力寻找一种做到这一点的好方法-我目前的解决方案是进行多次通过,每次检查重复项,并向一系列相同时间戳中的除第一项之外的所有内容加1ns.这是代码:

I've struggled to find a good way to do this - my current solution is to make multiple passes, checking for duplicates each time, and adding 1ns to all but the first in a series of identical timestamps. Here's the code:

def make_timestamps_unique(ts):
    mask = ts.index.duplicated('first')
    duplicate_count = np.sum(mask)
    passes = 0

    while duplicate_count > 0:
        ts.loc[:, 'newindex'] = ts.index
        ts.loc[mask, 'newindex'] += pd.Timedelta('1ns')
        ts = ts.set_index('newindex')
        mask = ts.index.duplicated('first')
        duplicate_count = np.sum(mask)
        passes += 1

    print('%d passes of duplication loop' % passes)
    return ts

这显然效率很低-它通常需要数百次通过,如果我在200万行数据帧上尝试,则会得到MemoryError.有更好的方法实现这一点的想法吗?

This is obviously quite inefficient - it often requires hundreds of passes, and if I try it on a 2 million row dataframe, I get a MemoryError. Any ideas for a better way to achieve this?

推荐答案

这是一个更快的numpy版本(但可读性较差),其灵感来自于

Here is a faster numpy version (but little less readable) which is inspired from this SO article. The idea is to use cumsum on duplicated timestamp values while resetting the cumulative sum each time a np.NaN is encountered:

# get duplicated values as float and replace 0 with NaN
values = df.index.duplicated(keep=False).astype(float)
values[values==0] = np.NaN

missings = np.isnan(values)
cumsum = np.cumsum(~missings)
diff = np.diff(np.concatenate(([0.], cumsum[missings])))
values[missings] = -diff

# print result
result = df.index + np.cumsum(values).astype(np.timedelta64)
print(result)

DatetimeIndex([   '2016-08-23 00:00:14.161128',
                  '2016-08-23 00:00:14.901180',
                  '2016-08-23 00:00:17.196639',
               '2016-08-23 00:00:17.664193001',
               '2016-08-23 00:00:17.664193002',
               '2016-08-23 00:00:17.664193003',
               '2016-08-23 00:00:17.664193004',
                  '2016-08-23 00:00:26.206108',
               '2016-08-23 00:00:28.322456001',
               '2016-08-23 00:00:28.322456002'],
              dtype='datetime64[ns]', name='datetime', freq=None)

将此解决方案命名为10000 loops, best of 3: 107 µs per loop,而使用100 loops, best of 3: 5.3 ms per loop的虚拟数据@DYZ groupby/apply方法(但更具可读性)大约慢50倍.

Timing this solution yields 10000 loops, best of 3: 107 µs per loop whereas the @DYZ groupby/apply approach (but more readable) is roughly 50 times slower on the dummy data with 100 loops, best of 3: 5.3 ms per loop.

当然,您最终必须重置索引:

Of course, you have to reset your index, finally:

df.index = result

这篇关于修改 pandas 中的时间戳以使索引唯一的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆