使用pandas或numpy填充缺少的时间序列数据 [英] Fill missing timeseries data using pandas or numpy

查看:935
本文介绍了使用pandas或numpy填充缺少的时间序列数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这样的字典列表:

I have a list of dictionaries which looks like this :

L=[
{
"timeline": "2014-10", 
"total_prescriptions": 17
}, 
{
"timeline": "2014-11", 
"total_prescriptions": 14
}, 
{
"timeline": "2014-12", 
"total_prescriptions": 8
},
{
"timeline": "2015-1", 
"total_prescriptions": 4
}, 
{
"timeline": "2015-3", 
"total_prescriptions": 10
}, 
{
"timeline": "2015-4", 
"total_prescriptions": 3
} 
]

这基本上是SQL查询的结果,当给出开始日期和结束日期时,会给出从开始日期到结束月份的每个月的总处方数. 0(2015年2月),它将完全跳过该月.是否可以使用pandas或numpy更改此列表,以便为缺少的月份添加一个条目,总处方为0,如下所示:

This basically is the result of a SQL query which when given a start date and an end date gives the count of total prescriptions for each month starting from the start date till the end month.However,for months where the prescriptions count is 0(Feb 2015),it completely skips that month.Is it possible using pandas or numpy to alter this list so that it adds an entry for the missing month with 0 as the total prescription as follows:

[
{
"timeline": "2014-10", 
"total_prescriptions": 17
}, 
{
"timeline": "2014-11", 
"total_prescriptions": 14
}, 
{
"timeline": "2014-12", 
"total_prescriptions": 8
{
"timeline": "2015-1", 
"total_prescriptions": 4
}, 
{
"timeline": "2015-2",   # 2015-2 to be inserted for missing month
"total_prescriptions": 0 # 0 to be inserted for total prescription
}, 
{
"timeline": "2015-3", 
"total_prescriptions": 10
}, 
{
"timeline": "2015-4", 
"total_prescriptions": 3
} 
]

推荐答案

您所说的在熊猫中被称为重新采样";首先将您的时间转换为numpy datetime并设置为索引:

What you are talking about is called "Resampling" in Pandas; first convert the your time to a numpy datetime and set as your index:

df = pd.DataFrame(L)
df.index=pd.to_datetime(df.timeline,format='%Y-%m')
df
           timeline  total_prescriptions
timeline                                
2014-10-01  2014-10                   17
2014-11-01  2014-11                   14
2014-12-01  2014-12                    8
2015-01-01   2015-1                    4
2015-03-01   2015-3                   10
2015-04-01   2015-4                    3

然后,您可以使用resample('MS')添加丢失的月份(我猜想MS代表月份开始"),然后根据需要使用fillna(0)将空值转换为零.

Then you can add in your missing months with resample('MS') (MS stands for "month start" I guess), and use fillna(0) to convert null values to zero as in your requirement.

df = df.resample('MS').fillna(0)
df
            total_prescriptions
timeline                       
2014-10-01                   17
2014-11-01                   14
2014-12-01                    8
2015-01-01                    4
2015-02-01                  NaN
2015-03-01                   10
2015-04-01                    3

要转换回原始格式,请使用to_native_types将日期时间索引转换回字符串,然后使用to_dict('records')导出:

To convert back to your original format, convert the datetime index back to string using to_native_types, and then export using to_dict('records'):

df['timeline']=df.index.to_native_types()
df.to_dict('records')
[{'timeline': '2014-10-01', 'total_prescriptions': 17.0},
 {'timeline': '2014-11-01', 'total_prescriptions': 14.0},
 {'timeline': '2014-12-01', 'total_prescriptions': 8.0},
 {'timeline': '2015-01-01', 'total_prescriptions': 4.0},
 {'timeline': '2015-02-01', 'total_prescriptions': 0.0},
 {'timeline': '2015-03-01', 'total_prescriptions': 10.0},
 {'timeline': '2015-04-01', 'total_prescriptions': 3.0}]

这篇关于使用pandas或numpy填充缺少的时间序列数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆