pandas 插值给出奇怪的结果 [英] Pandas interpolation giving odd results

查看:60
本文介绍了 pandas 插值给出奇怪的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Pandas在时间上对数据点进行插值,但是当进行重采样和插值时,如果使用不同的重采样率,则在相同的插值时间内会得到不同的结果.

I am using Pandas to interpolate datapoints in time, however when resampling and interpolating, I get different results for the same interpolated time when using different resampling rates.

这是一个测试示例:

import pandas as pd
import datetime

data = pd.DataFrame({'time': list(map(lambda a: datetime.datetime.strptime(a, '%Y-%m-%d %H:%M:%S'),
                                                ['2021-03-28 12:00:00', '2021-03-28 12:01:40',
                                                 '2021-03-28 12:03:20', '2021-03-28 12:05:00',
                                                 '2021-03-28 12:06:40', '2021-03-28 12:08:20',
                                                 '2021-03-28 12:10:00', '2021-03-28 12:11:40',
                                                 '2021-03-28 12:13:20', '2021-03-28 12:15:00'])),
                     'latitude': [44.0, 44.00463175663968, 44.00919766508212,
                                  44.01357245844425, 44.0176360866699, 44.02127701531401,
                                  44.02439529286458, 44.02690530159084, 44.02873811544965,
                                  44.02984339933479],
                     'longitude': [-62.75, -62.74998054893869, -62.748902164559304,
                                   -62.74679419470262, -62.7437142666763, -62.739746727555016,
                                   -62.735000345048086, -62.72960533041183, -62.72370976436673,
                                   -62.717475524320704]})

data.set_index('time', inplace=True)

a = data.resample('20s').interpolate(method='time')
b = data.resample('60s').interpolate(method='time')

print(a.iloc[:18:3])
print(b.iloc[:6])

# --- OUTPUT --- #

                      latitude  longitude
time                                     
2021-03-28 12:00:00  44.000000 -62.750000
2021-03-28 12:01:00  44.002779 -62.749988  # <-- Different Values
2021-03-28 12:02:00  44.005545 -62.749765  # <-- Different Values
2021-03-28 12:03:00  44.008284 -62.749118  # <-- Different Values
2021-03-28 12:04:00  44.010948 -62.748059  # <-- Different Values
2021-03-28 12:05:00  44.013572 -62.746794
                      latitude  longitude
time                                     
2021-03-28 12:00:00  44.000000 -62.750000
2021-03-28 12:01:00  44.002714 -62.749359  # <-- Different Values
2021-03-28 12:02:00  44.005429 -62.748718  # <-- Different Values
2021-03-28 12:03:00  44.008143 -62.748077  # <-- Different Values
2021-03-28 12:04:00  44.010858 -62.747435  # <-- Different Values
2021-03-28 12:05:00  44.013572 -62.746794 

a 数据框和 b 数据框应在分钟内预测相同的值,但是在大多数情况下,它们此时有所不同.

The a dataframe and b dataframe should predict the same value on the minute, however in most cases they differ at this time.

有人知道这是什么原因吗?在绘制全部结果时,似乎在分钟上重新采样会导致熊猫忽略不在分钟上的时间戳中的数据(例如12:01:40和12:03:20).

Does anyone know what could be causing this? When plotting the full results, it looks like resampling on the minute causes pandas to ignore data in timestamps that are not on the minute (12:01:40 and 12:03:20 for example).

推荐答案

我的评论和一些解释的总结:

如果将 data.resample('60s').asfreq() data.resample('20s').asfreq()进行比较,您可以观察发生了什么.虽然所有样本数据都适合20s网格,但只有很少的值保留在60s网格中.熊猫重新采样插值产生的NaNs 基本描述了问题.

You can observe what's happening if you compare data.resample('60s').asfreq() to data.resample('20s').asfreq(). While all of the sample data fits into the 20s grid, only few values remain in the 60s grid. pandas resample interpolate is producing NaNs basically describes the problem.

重点是 pandas 重新采样并 then 进行插值.如果重新采样导致数据丢失,则这些数据不可用于插值.如果要利用最初拥有的所有数据,则需要进行插值然后重设索引.您可以这样做

The point is, pandas resamples and then interpolates. If resampling leads to loss of data, those data is not available for interpolation. If you want to make use of all the data you have initially, you'll want to interpolate and then reset the index. You can do so like

# let's create new indices, the desired index...
new_index_20s = pd.date_range(data.index.min(), data.index.max(), freq='20s')
# and a helper for interpolation; the combination of existing and desired index
tmp_index_20s = data.index.union(new_index_20s)

new_index_60s = pd.date_range(data.index.min(), data.index.max(), freq='60s')
tmp_index_60s = data.index.union(new_index_20s)

# re-index to the helper index,
# interpolate,
# and re-index to desired index 
a1 = data.reindex(tmp_index_20s).interpolate('index').reindex(new_index_20s)
b1 = data.reindex(tmp_index_60s).interpolate('index').reindex(new_index_60s)

现在,您在产生的时间序列中达成了共识:

Now you have agreement in the resulting time series:

print(a1.iloc[:18:3])
print(b1.iloc[:6])
                      latitude  longitude
2021-03-28 12:00:00  44.000000 -62.750000
2021-03-28 12:01:00  44.002779 -62.749988
2021-03-28 12:02:00  44.005545 -62.749765
2021-03-28 12:03:00  44.008284 -62.749118
2021-03-28 12:04:00  44.010948 -62.748059
2021-03-28 12:05:00  44.013572 -62.746794
                      latitude  longitude
2021-03-28 12:00:00  44.000000 -62.750000
2021-03-28 12:01:00  44.002779 -62.749988
2021-03-28 12:02:00  44.005545 -62.749765
2021-03-28 12:03:00  44.008284 -62.749118
2021-03-28 12:04:00  44.010948 -62.748059
2021-03-28 12:05:00  44.013572 -62.746794

这篇关于 pandas 插值给出奇怪的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆