通过任意因子重新采样 pandas 数据框 [英] Resample a pandas dataframe by an arbitrary factor

查看:48
本文介绍了通过任意因子重新采样 pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果您的索引使用日期时间索引,Pandas 重采样真的很方便,但我还没有找到一个简单的实现来通过任意因子重新采样.例如,只需将每个索引视为一个任意索引,并对数据帧重新采样,使其结果长度缩短 4 倍(并且比每 4 个数据点更智能).

Pandas resampling is really convenient if your indices use datetime indexing, but I haven't found an easy implementation to resample by an arbitrary factor. E.g., just treat each index as an arbitrary index, and resample the dataframe so that its resulting length is 4X shorter (and being more intelligent about it than just taking every 4th datapoint).

这对于处理在比日期时间短得多的时间尺度上运行的数据的任何人都非常有用.例如,在我的情况下,我想将音频向量从 44KHz 重新采样到 11KHz.现在我必须使用 scipy 的抽取"函数,然后将其重新转换回数据帧(使用 dataframe.apply 不起作用,因为它改变了数据帧的长度).

This would be useful for anyone that's working with data that operates on a much shorter timescale than datetimes. For example, in my case I want to resample an audio vector from 44KHz to 11KHz. Right now I have to use scipy's "decimate" function, and then re-convert it back to a dataframe (using dataframe.apply wasn't working because it changes the length of the dataframe).

有人对如何实现这一目标有任何建议吗?

Anyone have any suggestions for how to accomplish this?

推荐答案

您可以使用 DatetimeIndex 对高频数据重新采样(高达纳秒精度,警告:我相信这仅在即将发布的 0.13 版本中可用).我已经成功地使用 pandas 在 24KHz 范围内重新采样电生理数据.举个例子:

You can use DatetimeIndex to resample high frequency data (up to nanosecond precision, caveat: I believe this is only available in the upcoming 0.13 release). I've successfully used pandas to resample electrophysiological data in the 24KHz range. Here's an example:

In [97]: index = date_range('1/1/2001 00:00:00', '1/1/2001 00:00:01', freq='22727N')

In [98]: index
Out[98]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2001-01-01 00:00:00, ..., 2001-01-01 00:00:00.999988]
Length: 44001, Freq: 22727N, Timezone: None

In [99]: s = Series(randn(index.size), index=index)

In [100]: s.head(10)
Out[100]:
2001-01-01 00:00:00          -0.820
2001-01-01 00:00:00.000022   -1.141
2001-01-01 00:00:00.000045    1.577
2001-01-01 00:00:00.000068   -1.031
2001-01-01 00:00:00.000090    0.343
2001-01-01 00:00:00.000113   -0.424
2001-01-01 00:00:00.000136   -0.753
2001-01-01 00:00:00.000159    0.411
2001-01-01 00:00:00.000181    0.238
2001-01-01 00:00:00.000204    1.048
Freq: 22727N, dtype: float64

In [101]: s.resample(s.index.freq * 4, how='mean')
Out[101]:
2001-01-01 00:00:00          -0.354
2001-01-01 00:00:00.000090   -0.106
2001-01-01 00:00:00.000181    0.245
2001-01-01 00:00:00.000272    0.568
2001-01-01 00:00:00.000363    0.047
2001-01-01 00:00:00.000454   -0.560
2001-01-01 00:00:00.000545   -0.485
2001-01-01 00:00:00.000636   -0.271
2001-01-01 00:00:00.000727   -0.457
2001-01-01 00:00:00.000818    0.078
2001-01-01 00:00:00.000909    0.394
2001-01-01 00:00:00.000999    0.185
2001-01-01 00:00:00.001090   -0.441
2001-01-01 00:00:00.001181    0.300
2001-01-01 00:00:00.001272   -0.521
...
2001-01-01 00:00:00.998715   -0.045
2001-01-01 00:00:00.998806   -0.044
2001-01-01 00:00:00.998897    0.090
2001-01-01 00:00:00.998988    0.748
2001-01-01 00:00:00.999078   -0.179
2001-01-01 00:00:00.999169    0.451
2001-01-01 00:00:00.999260   -1.041
2001-01-01 00:00:00.999351   -0.476
2001-01-01 00:00:00.999442   -0.234
2001-01-01 00:00:00.999533   -0.719
2001-01-01 00:00:00.999624   -0.606
2001-01-01 00:00:00.999715   -0.032
2001-01-01 00:00:00.999806   -0.296
2001-01-01 00:00:00.999897   -0.044
2001-01-01 00:00:00.999988   -0.951
Freq: 90908N, Length: 11001

您可以将可调用对象传递给 how,这将允许您做一些更智能的事情".pandas 默认取给定时间段内的平均值(在本例中,这是每个 22727 个样本块的平均值).

You can pass in a callable to how, which would allow you to "do something more intelligent". pandas defaults to taking the average over the period given (in this case, that's the average over each chunk of 22727 samples).

这篇关于通过任意因子重新采样 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆