在不丢失时间的情况下重新采样 pandas 数据框 [英] Resample Pandas Dataframe Without Filling in Missing Times

查看:93
本文介绍了在不丢失时间的情况下重新采样 pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

重新采样数据帧可以使数据帧具有更高或更低的时间分辨率。在大多数情况下,这是用来降低分辨率的(例如,将1分钟的数据重新采样为月度值)。当数据集稀疏时(例如,在2020年2月未收集到任何数据),则2020年2月的行将用重新采样的数据帧的NaN填充。问题是,当数据记录较长且稀疏时,会有许多NaN行,这会使数据帧不必要地变大,并占用大量CPU时间。例如,考虑此数据帧并重新采样操作:

Resampling a dataframe can take the dataframe to either a higher or lower temporal resolution. Most of the time this is used to go to lower resolution (e.g. resample 1-minute data to monthly values). When the dataset is sparse (for example, no data were collected in Feb-2020) then the Feb-2020 row in will be filled with NaNs the resampled dataframe. The problem is when the data record is long AND sparse there are a lot of NaN rows, which makes the dataframe unnecessarily large and takes a lot of CPU time. For example, consider this dataframe and resample operation:

import numpy as np
import pandas as pd

freq1 = pd.date_range("20000101", periods=10, freq="S")
freq2 = pd.date_range("20200101", periods=10, freq="S")

index = np.hstack([freq1.values, freq2.values])
data = np.random.randint(0, 100, (20, 10))
cols = list("ABCDEFGHIJ")

df = pd.DataFrame(index=index, data=data, columns=cols)

# now resample to daily average
df = df.resample(rule="1D").mean()

此数据帧中的大多数数据是没有用的,可以通过以下方式将其删除:

Most of the data in this dataframe is useless and can be removed via:

df.dropna(how="all", axis=0, inplace=True)

但是,这很草率。是否有另一种方法可以对没有用NaN填补所有数据空白的数据框进行重新采样(即,在上面的示例中,结果数据框将只有两行)?

however, this is sloppy. Is there another method to resample the dataframe that does not fill all of the data gaps with NaN (i.e. in the example above, the resultant dataframe would have only two rows)?

推荐答案

在这里,您可以使用 groupby 代替 resample

Here, you could use groupby instead of resample:

df = df.groupby(df.index.date).mean()

这对于 1D 规则非常有效,因为您可以轻松地在数据集中找到唯一的日期。时间,这比重新采样要快:

This works well for a "1D" rule because you can easily find the unique dates in the data set. Timing, this is faster than the resample:

%%timeit
df.resample(rule='1D').mean().dropna(how="all", axis=0, inplace=True)
#2.79 ms ± 77.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.groupby(df.index.date).mean()
#974 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

我还想到了在重新加入之前检查每个重新采样的垃圾箱以查看其是否为空,但这是糟糕的

I also thought of checking each resampled bin to see if it is empty before joining, but this is lousy:

%%timeit
pd.DataFrame([d.mean(axis=0).rename(i) for i,d in df.resample(rule="1D") if not d.empty])
#899 ms ± 19.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

我可以想象 groupby 版本会因不同而更加复杂频率,或者说是否要平均多个时间单位而不是一个时间单位(例如,在这里执行 4D ,也许您有2周的数据n个末端)。但是也许通过重新标记一些日期或在 groupby 调用中使用多个时间属性仍然可能。

I can imagine the groupby version will be more complicated with different frequencies, or say if you wanted multiple time units averaged instead of one (say doing "4D" here, maybe if you had 2 weeks of data on each end). But maybe this would still be possible either by relabeling some dates or by using multiple time attributes in the groupby call.

我还认为尝试使用 pd.cut 的方法,但这也明显更糟(我将 df 构造线包括在 %% timeit 工作,但 resample 的结果在本质上是相同的(如果也包含在其中):

I also thought of trying to use pd.cut, but this is also significantly worse (I include the df construction line to make %%timeit work but the results for resample are essentially the same if it is also included there):

%%timeit
df = pd.DataFrame(index=index, data=data, columns=cols)
dr = pd.date_range(df.index.min(), df.index.max() + pd.Timedelta(days=1), freq='1D') #resample at the desired frequency and add an extra bin to make sure all data is inlcuded
bins = pd.cut(df.index, dr, include_lowest=True, right=False)
df.index = bins.remove_unused_categories()
output = df.groupby(df.index).mean()
#101 ms ± 832 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

这篇关于在不丢失时间的情况下重新采样 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆