大 pandas :每60秒内只保留第一行数据 [英] Pandas: keeping only first row of data in each 60 second bin

查看:159
本文介绍了大 pandas :每60秒内只保留第一行数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在大熊猫中保留每60秒数据的第一行的最佳方式是什么?即对于在增加时间 t 中发生的每一行,我想删除最多发生在 t + 60 秒。



我知道我可以使用的 groupby()。first()我看到的代码示例(例如使用 pandas.Grouper(freq ='60s'))将丢弃原始数据时间,有利于从午夜抵消每60秒,而不是我的原始数据时间。



例如,以下内容:

 价值
0✓-1113:00:10.841015028 0.215978
1✓-1113:02:05.760595780 0.155666
2✓-1113:02:05.760903860 0.155666
3pite-11 13:02:18.325613076 0.157788
4USTR-11 13:02:18.486519052 0.157788
5▲-1113:02:20.243748548 0.157788
6✓-1113:02:20.533101692 0.157788
7✓-1113:02:20.646061652 0.157788
8 5-11 13:02:21.121409820 0.157788
9✓-1113:04:24.660609068 0.211649
10✓-1113:04:24.660845612 0.211649
11 texts- 11 13:04:24.660957596 0.211649
12✓-1113:04:24.661378132 0.211649
13✓-1113:04:24.661450628 0.211649
14 2016-05-11 13 :04:24.661607044 0.211649

应该成为:

 时间价值
0✓-1113:00:10.841015028 0.215978
1✓-1113:02:05.760595780 0.155666
3 2016-0511 13:04:24.660609068 0.211649


解决方案

请参阅






安排时间为500,000行



  pop_n,smp_n = 1000000,500000 
np.random.seed([3,1415])
tidx = pd.date_range('2016-09-08',periods = pop_n,freq ='5s ')
tidx = np.random.choice(tidx,smp_n,False)
tidx = pd.to_datetime(tidx).sort_values()

df = pd.DataFrame dict(time = tidx,value = np.random.rand(smp_n)))






计时





Cythonize

在Jupyter

 %load_ext Cython 






  %% cython 
import numpy as np
import pandas as pd

def td60(ta):
d = np.timedelta64(int(6e10))
tp = ta + d
j = 0
产生j
为i,tx在枚举(ta)中:
如果tx> tp [j]:
yield i
j = i

def pir(df):
slc = list(td60(df.time.values))
return pd.DataFrame(df.values [slc],df.index [slc])

Cython化之后

没有太大的不同








OP示例

的参考设置

<$ p $从StringIO导入String $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $


$ ,0.215978
IBUS-11 13:02:05.760595780,0.155666
IBUS-11 13:02:05.760903860,0.155666
IBUS-11 13:02:18.325613076,0.157788
IBUS-11 13:02:18.486519052,0.157788
IBUS-11 13:02:20.243748548,0.157788
IBUS-11 13:02:20.533101692,0.157788
IBUS-11 13:02:20.646061652,0.157788
ludge-11 13:02:21.121409820,0.157788
IBUS-11 13:04:24.660609068,0.211649
IBUS-11 13:04:24.660845612,0.211649
2016 -05-11 13:04:24.660957596,0.211649
IBUS-11 13:04:24.661378132,0.211649
IBUS-11 13:04:24.661450628,0.211649
-11 13:04:24.661607044,0.211649

df = pd.read_csv(StringIO(text),parse_dates = [0])
pre>

What's the best way to keep only the first row of each 60 second bin of data in pandas? i.e. For every row that occurs at increasing time t, I want to delete all rows that occur up to t+60 seconds.

I know there's some combination of groupby().first() that I can probably use, but the code examples I've seen (e.g. using pandas.Grouper(freq='60s')) will discard the original datetimes in favor of every 60 seconds offset from midnight rather than my original datetimes.

For example, the following:

                            time        value
0  2016-05-11 13:00:10.841015028     0.215978
1  2016-05-11 13:02:05.760595780     0.155666
2  2016-05-11 13:02:05.760903860     0.155666
3  2016-05-11 13:02:18.325613076     0.157788
4  2016-05-11 13:02:18.486519052     0.157788
5  2016-05-11 13:02:20.243748548     0.157788
6  2016-05-11 13:02:20.533101692     0.157788
7  2016-05-11 13:02:20.646061652     0.157788
8  2016-05-11 13:02:21.121409820     0.157788
9  2016-05-11 13:04:24.660609068     0.211649
10 2016-05-11 13:04:24.660845612     0.211649
11 2016-05-11 13:04:24.660957596     0.211649
12 2016-05-11 13:04:24.661378132     0.211649
13 2016-05-11 13:04:24.661450628     0.211649
14 2016-05-11 13:04:24.661607044     0.211649

should become this:

                            time        value
0  2016-05-11 13:00:10.841015028     0.215978
1  2016-05-11 13:02:05.760595780     0.155666
3  2016-05-11 13:04:24.660609068     0.211649

解决方案

see Path Dependent Slicing

Solution

def td60(ta):
    d = np.timedelta64(int(6e10))
    tp = ta + d
    j = 0
    yield j
    for i, tx in enumerate(ta):
        if tx > tp[j]:
            yield i
            j = i

def pir(df):
    slc = list(td60(df.time.values))
    return pd.DataFrame(df.values[slc], df.index[slc])


Example usage

pir(df)


Setup for timing 500,000 rows

pop_n, smp_n = 1000000, 500000
np.random.seed([3,1415])
tidx = pd.date_range('2016-09-08', periods=pop_n, freq='5s')
tidx = np.random.choice(tidx, smp_n, False)
tidx = pd.to_datetime(tidx).sort_values()

df = pd.DataFrame(dict(time=tidx, value=np.random.rand(smp_n)))


Timing

Cythonize
In Jupyter

%load_ext Cython


%%cython
import numpy as np
import pandas as pd

def td60(ta):
    d = np.timedelta64(int(6e10))
    tp = ta + d
    j = 0
    yield j
    for i, tx in enumerate(ta):
        if tx > tp[j]:
            yield i
            j = i

def pir(df):
    slc = list(td60(df.time.values))
    return pd.DataFrame(df.values[slc], df.index[slc])

After Cythonizing
Not much different


reference setup for OP example

from StringIO import StringIO
import pandas as pd

text = """time,value
2016-05-11 13:00:10.841015028,0.215978
2016-05-11 13:02:05.760595780,0.155666
2016-05-11 13:02:05.760903860,0.155666
2016-05-11 13:02:18.325613076,0.157788
2016-05-11 13:02:18.486519052,0.157788
2016-05-11 13:02:20.243748548,0.157788
2016-05-11 13:02:20.533101692,0.157788
2016-05-11 13:02:20.646061652,0.157788
2016-05-11 13:02:21.121409820,0.157788
2016-05-11 13:04:24.660609068,0.211649
2016-05-11 13:04:24.660845612,0.211649
2016-05-11 13:04:24.660957596,0.211649
2016-05-11 13:04:24.661378132,0.211649
2016-05-11 13:04:24.661450628,0.211649
2016-05-11 13:04:24.661607044,0.211649"""

df = pd.read_csv(StringIO(text), parse_dates=[0])

这篇关于大 pandas :每60秒内只保留第一行数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆