大 pandas :每60秒内只保留第一行数据 [英] Pandas: keeping only first row of data in each 60 second bin
问题描述
在大熊猫中保留每60秒数据的第一行的最佳方式是什么?即对于在增加时间 t
中发生的每一行,我想删除最多发生在 t + 60
秒。
我知道我可以使用的 groupby()。first()
我看到的代码示例(例如使用 pandas.Grouper(freq ='60s')
)将丢弃原始数据时间,有利于从午夜抵消每60秒,而不是我的原始数据时间。
例如,以下内容:
价值
0✓-1113:00:10.841015028 0.215978
1✓-1113:02:05.760595780 0.155666
2✓-1113:02:05.760903860 0.155666
3pite-11 13:02:18.325613076 0.157788
4USTR-11 13:02:18.486519052 0.157788
5▲-1113:02:20.243748548 0.157788
6✓-1113:02:20.533101692 0.157788
7✓-1113:02:20.646061652 0.157788
8 5-11 13:02:21.121409820 0.157788
9✓-1113:04:24.660609068 0.211649
10✓-1113:04:24.660845612 0.211649
11 texts- 11 13:04:24.660957596 0.211649
12✓-1113:04:24.661378132 0.211649
13✓-1113:04:24.661450628 0.211649
14 2016-05-11 13 :04:24.661607044 0.211649
应该成为:
时间价值
0✓-1113:00:10.841015028 0.215978
1✓-1113:02:05.760595780 0.155666
3 2016-0511 13:04:24.660609068 0.211649
请参阅
安排时间为500,000行
pop_n,smp_n = 1000000,500000
np.random.seed([3,1415])
tidx = pd.date_range('2016-09-08',periods = pop_n,freq ='5s ')
tidx = np.random.choice(tidx,smp_n,False)
tidx = pd.to_datetime(tidx).sort_values()
df = pd.DataFrame dict(time = tidx,value = np.random.rand(smp_n)))
计时
Cythonize
在Jupyter
%load_ext Cython
%% cython
import numpy as np
import pandas as pd
def td60(ta):
d = np.timedelta64(int(6e10))
tp = ta + d
j = 0
产生j
为i,tx在枚举(ta)中:
如果tx> tp [j]:
yield i
j = i
def pir(df):
slc = list(td60(df.time.values))
return pd.DataFrame(df.values [slc],df.index [slc])
Cython化之后
没有太大的不同
OP示例
的参考设置<$ p $从StringIO导入String $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $
$ ,0.215978
IBUS-11 13:02:05.760595780,0.155666
IBUS-11 13:02:05.760903860,0.155666
IBUS-11 13:02:18.325613076,0.157788
IBUS-11 13:02:18.486519052,0.157788
IBUS-11 13:02:20.243748548,0.157788
IBUS-11 13:02:20.533101692,0.157788
IBUS-11 13:02:20.646061652,0.157788
ludge-11 13:02:21.121409820,0.157788
IBUS-11 13:04:24.660609068,0.211649
IBUS-11 13:04:24.660845612,0.211649
2016 -05-11 13:04:24.660957596,0.211649
IBUS-11 13:04:24.661378132,0.211649
IBUS-11 13:04:24.661450628,0.211649
-11 13:04:24.661607044,0.211649
df = pd.read_csv(StringIO(text),parse_dates = [0])
pre>
What's the best way to keep only the first row of each 60 second bin of data in pandas? i.e. For every row that occurs at increasing time t
, I want to delete all rows that occur up to t+60
seconds.
I know there's some combination of groupby().first()
that I can probably use, but the code examples I've seen (e.g. using pandas.Grouper(freq='60s')
) will discard the original datetimes in favor of every 60 seconds offset from midnight rather than my original datetimes.
For example, the following:
time value
0 2016-05-11 13:00:10.841015028 0.215978
1 2016-05-11 13:02:05.760595780 0.155666
2 2016-05-11 13:02:05.760903860 0.155666
3 2016-05-11 13:02:18.325613076 0.157788
4 2016-05-11 13:02:18.486519052 0.157788
5 2016-05-11 13:02:20.243748548 0.157788
6 2016-05-11 13:02:20.533101692 0.157788
7 2016-05-11 13:02:20.646061652 0.157788
8 2016-05-11 13:02:21.121409820 0.157788
9 2016-05-11 13:04:24.660609068 0.211649
10 2016-05-11 13:04:24.660845612 0.211649
11 2016-05-11 13:04:24.660957596 0.211649
12 2016-05-11 13:04:24.661378132 0.211649
13 2016-05-11 13:04:24.661450628 0.211649
14 2016-05-11 13:04:24.661607044 0.211649
should become this:
time value
0 2016-05-11 13:00:10.841015028 0.215978
1 2016-05-11 13:02:05.760595780 0.155666
3 2016-05-11 13:04:24.660609068 0.211649
Solution
def td60(ta):
d = np.timedelta64(int(6e10))
tp = ta + d
j = 0
yield j
for i, tx in enumerate(ta):
if tx > tp[j]:
yield i
j = i
def pir(df):
slc = list(td60(df.time.values))
return pd.DataFrame(df.values[slc], df.index[slc])
Example usage
pir(df)
Setup for timing 500,000 rows
pop_n, smp_n = 1000000, 500000
np.random.seed([3,1415])
tidx = pd.date_range('2016-09-08', periods=pop_n, freq='5s')
tidx = np.random.choice(tidx, smp_n, False)
tidx = pd.to_datetime(tidx).sort_values()
df = pd.DataFrame(dict(time=tidx, value=np.random.rand(smp_n)))
Timing
Cythonize
In Jupyter
%load_ext Cython
%%cython
import numpy as np
import pandas as pd
def td60(ta):
d = np.timedelta64(int(6e10))
tp = ta + d
j = 0
yield j
for i, tx in enumerate(ta):
if tx > tp[j]:
yield i
j = i
def pir(df):
slc = list(td60(df.time.values))
return pd.DataFrame(df.values[slc], df.index[slc])
After Cythonizing
Not much different
reference setup for OP example
from StringIO import StringIO
import pandas as pd
text = """time,value
2016-05-11 13:00:10.841015028,0.215978
2016-05-11 13:02:05.760595780,0.155666
2016-05-11 13:02:05.760903860,0.155666
2016-05-11 13:02:18.325613076,0.157788
2016-05-11 13:02:18.486519052,0.157788
2016-05-11 13:02:20.243748548,0.157788
2016-05-11 13:02:20.533101692,0.157788
2016-05-11 13:02:20.646061652,0.157788
2016-05-11 13:02:21.121409820,0.157788
2016-05-11 13:04:24.660609068,0.211649
2016-05-11 13:04:24.660845612,0.211649
2016-05-11 13:04:24.660957596,0.211649
2016-05-11 13:04:24.661378132,0.211649
2016-05-11 13:04:24.661450628,0.211649
2016-05-11 13:04:24.661607044,0.211649"""
df = pd.read_csv(StringIO(text), parse_dates=[0])
这篇关于大 pandas :每60秒内只保留第一行数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!