大 pandas ：每60秒内只保留第一行数据 [英] Pandas: keeping only first row of data in each 60 second bin

查看：159 发布时间：2017/3/26 1:40:41 python pandas dataframe

本文介绍了大 pandas ：每60秒内只保留第一行数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在大熊猫中保留每60秒数据的第一行的最佳方式是什么？即对于在增加时间 t 中发生的每一行，我想删除最多发生在 t + 60 秒。

我知道我可以使用的 groupby（）。first（）我看到的代码示例（例如使用 pandas.Grouper（freq ='60s'））将丢弃原始数据时间，有利于从午夜抵消每60秒，而不是我的原始数据时间。

例如，以下内容：

 价值
0✓-1113：00：10.841015028 0.215978 
1✓-1113：02：05.760595780 0.155666 
2✓-1113：02：05.760903860 0.155666 
 3pite-11 13：02：18.325613076 0.157788 
 4USTR-11 13：02：18.486519052 0.157788 
5▲-1113：02：20.243748548 0.157788 
6✓-1113：02：20.533101692 0.157788 
7✓-1113：02：20.646061652 0.157788 
 8 5-11 13：02：21.121409820 0.157788 
9✓-1113：04：24.660609068 0.211649 
10✓-1113：04：24.660845612 0.211649 
 11 texts- 11 13：04：24.660957596 0.211649 
12✓-1113：04：24.661378132 0.211649 
13✓-1113：04：24.661450628 0.211649 
 14 2016-05-11 13 ：04：24.661607044 0.211649

应该成为：

 时间价值
0✓-1113：00：10.841015028 0.215978 
1✓-1113：02：05.760595780 0.155666 
 3 2016-0511 13：04：24.660609068 0.211649

解决方案

请参阅

安排时间为500,000行

  pop_n，smp_n = 1000000，500000 
 np.random.seed（[3,1415]）
 tidx = pd.date_range（'2016-09-08'，periods = pop_n，freq ='5s '）
 tidx = np.random.choice（tidx，smp_n，False）
 tidx = pd.to_datetime（tidx）.sort_values（）
 
 df = pd.DataFrame dict（time = tidx，value = np.random.rand（smp_n）））

计时

Cythonize

在Jupyter

 ％load_ext Cython

  %% cython 
 import numpy as np 
 import pandas as pd 
 
 def td60（ta）：
d = np.timedelta64（int（6e10））
 tp = ta + d 
j = 0 
产生j 
为i，tx在枚举（ta）中：
如果tx> tp [j]：
 yield i 
j = i 
 
 def pir（df）：
 slc = list（td60（df.time.values））
 return pd.DataFrame（df.values [slc]，df.index [slc]）

Cython化之后

没有太大的不同

OP示例

的参考设置

<$ p $从StringIO导入String $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $

$ ，0.215978
IBUS-11 13：02：05.760595780,0.155666
IBUS-11 13：02：05.760903860,0.155666
IBUS-11 13：02：18.325613076,0.157788
IBUS-11 13：02：18.486519052,0.157788
IBUS-11 13：02：20.243748548,0.157788
IBUS-11 13：02：20.533101692,0.157788
IBUS-11 13：02：20.646061652,0.157788
ludge-11 13：02：21.121409820,0.157788
IBUS-11 13：04：24.660609068,0.211649
IBUS-11 13：04：24.660845612,0.211649
2016 -05-11 13：04：24.660957596,0.211649
IBUS-11 13：04：24.661378132,0.211649
IBUS-11 13：04：24.661450628,0.211649
-11 13：04：24.661607044,0.211649

df = pd.read_csv（StringIO（text），parse_dates = [0]）
pre>

What's the best way to keep only the first row of each 60 second bin of data in pandas? i.e. For every row that occurs at increasing time t, I want to delete all rows that occur up to t+60 seconds.

I know there's some combination of groupby().first() that I can probably use, but the code examples I've seen (e.g. using pandas.Grouper(freq='60s')) will discard the original datetimes in favor of every 60 seconds offset from midnight rather than my original datetimes.

For example, the following:

                            time        value
0  2016-05-11 13:00:10.841015028     0.215978
1  2016-05-11 13:02:05.760595780     0.155666
2  2016-05-11 13:02:05.760903860     0.155666
3  2016-05-11 13:02:18.325613076     0.157788
4  2016-05-11 13:02:18.486519052     0.157788
5  2016-05-11 13:02:20.243748548     0.157788
6  2016-05-11 13:02:20.533101692     0.157788
7  2016-05-11 13:02:20.646061652     0.157788
8  2016-05-11 13:02:21.121409820     0.157788
9  2016-05-11 13:04:24.660609068     0.211649
10 2016-05-11 13:04:24.660845612     0.211649
11 2016-05-11 13:04:24.660957596     0.211649
12 2016-05-11 13:04:24.661378132     0.211649
13 2016-05-11 13:04:24.661450628     0.211649
14 2016-05-11 13:04:24.661607044     0.211649

should become this:

                            time        value
0  2016-05-11 13:00:10.841015028     0.215978
1  2016-05-11 13:02:05.760595780     0.155666
3  2016-05-11 13:04:24.660609068     0.211649

解决方案

see Path Dependent Slicing

Solution

def td60(ta):
    d = np.timedelta64(int(6e10))
    tp = ta + d
    j = 0
    yield j
    for i, tx in enumerate(ta):
        if tx > tp[j]:
            yield i
            j = i

def pir(df):
    slc = list(td60(df.time.values))
    return pd.DataFrame(df.values[slc], df.index[slc])

Example usage

pir(df)

Setup for timing 500,000 rows

pop_n, smp_n = 1000000, 500000
np.random.seed([3,1415])
tidx = pd.date_range('2016-09-08', periods=pop_n, freq='5s')
tidx = np.random.choice(tidx, smp_n, False)
tidx = pd.to_datetime(tidx).sort_values()

df = pd.DataFrame(dict(time=tidx, value=np.random.rand(smp_n)))

Timing

Cythonize
In Jupyter

%load_ext Cython

%%cython
import numpy as np
import pandas as pd

def td60(ta):
    d = np.timedelta64(int(6e10))
    tp = ta + d
    j = 0
    yield j
    for i, tx in enumerate(ta):
        if tx > tp[j]:
            yield i
            j = i

def pir(df):
    slc = list(td60(df.time.values))
    return pd.DataFrame(df.values[slc], df.index[slc])

After Cythonizing
Not much different

reference setup for OP example

from StringIO import StringIO
import pandas as pd

text = """time,value
2016-05-11 13:00:10.841015028,0.215978
2016-05-11 13:02:05.760595780,0.155666
2016-05-11 13:02:05.760903860,0.155666
2016-05-11 13:02:18.325613076,0.157788
2016-05-11 13:02:18.486519052,0.157788
2016-05-11 13:02:20.243748548,0.157788
2016-05-11 13:02:20.533101692,0.157788
2016-05-11 13:02:20.646061652,0.157788
2016-05-11 13:02:21.121409820,0.157788
2016-05-11 13:04:24.660609068,0.211649
2016-05-11 13:04:24.660845612,0.211649
2016-05-11 13:04:24.660957596,0.211649
2016-05-11 13:04:24.661378132,0.211649
2016-05-11 13:04:24.661450628,0.211649
2016-05-11 13:04:24.661607044,0.211649"""

df = pd.read_csv(StringIO(text), parse_dates=[0])

这篇关于大 pandas ：每60秒内只保留第一行数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

大 pandas ：每60秒内只保留第一行数据 [英] Pandas: keeping only first row of data in each 60 second bin

问题描述

安排时间为500,000行

计时

OP示例

Solution

Setup for timing 500,000 rows

Timing

reference setup for OP example

相关文章

Python最新文章

热门教程

热门工具

登录关闭

大 pandas ：每60秒内只保留第一行数据 [英] Pandas: keeping only first row of data in each 60 second bin

问题描述

安排时间为500,000行

计时

OP示例

Solution

Setup for timing 500,000 rows

Timing

reference setup for OP example

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭