在已排序的 pandas 数据框中按时间戳搜索元素 [英] Search for elements by timestamp in a sorted pandas dataframe

查看:69
本文介绍了在已排序的 pandas 数据框中按时间戳搜索元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的熊猫数据框/系列,包含数百万个元素. 而且我需要找到所有时间戳都小于<的元素.比t0. 所以通常我会做的是:

I have a very large pandas dataframe/series with millions of elements. And I need to find all the elements for which timestamp is < than t0. So normally what I would do is:

selected_df = df[df.index < t0]

这需要一段时间.据我了解,大熊猫搜索时会遍历数据框的每个元素.但是我知道我的数据帧已排序,因此只要时间戳> t0,我就可以中断循环.我假设熊猫不知道数据帧是经过排序的,而是在所有时间戳中进行搜索.

This takes ages. As I understand when pandas searches it goes through every element of the dataframe. However I know that my dataframe is sorted hence I can break the loop as soon as the timestamp is > t0. I assume pandas doesn't know that dataframe is sorted and searches through all timestamps.

我尝试改用pandas.Series-仍然很慢. 我试图编写自己的循环,例如:

I have tried to use pandas.Series instead - still very slow. I have tried to write my own loop like:

boudery = 0
ticks_time_list = df.index
tsearch = ticks_time_list[0]
while tsearch < t0:
      tsearch = ticks_time_list[boudery]
      boudery += 1      
selected_df = df[:boudery]

这比熊猫搜索花费的时间更长. 我能看到的唯一解决方案是使用Cython. 有什么想法可以在不涉及C的情况下进行排序吗?

This takes even longer than pandas search. The only solution I can see atm is to use Cython. Any ideas how this can be sorted without C involved?

推荐答案

即使是很长的时间,它似乎也不需要花很多时间:

It doesn't really seem to take ages for me, even with a long frame:

>>> df = pd.DataFrame({"A": 2, "B": 3}, index=pd.date_range("2001-01-01", freq="1 min", periods=10**7))
>>> len(df)
10000000
>>> %timeit df[df.index < "2001-09-01"]
100 loops, best of 3: 18.5 ms per loop

但是,如果我们真的想挤出性能的每一个下降,我们可以使用

But if we're really trying to squeeze out every drop of performance, we can use the searchsorted method after dropping down to numpy:

>>> %timeit df.iloc[:df.index.values.searchsorted(np.datetime64("2001-09-01"))]
10000 loops, best of 3: 51.9 µs per loop
>>> df[df.index < "2001-09-01"].equals(df.iloc[:df.index.values.searchsorted(np.datetime64("2001-09-01"))])
True

,速度快了很多倍.

这篇关于在已排序的 pandas 数据框中按时间戳搜索元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆