在MultiIndex中插入缺失值的0值 [英] Insert 0-values for missing dates within MultiIndex

查看:80
本文介绍了在MultiIndex中插入缺失值的0值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们假设我有一个MultiIndex,它由日期和一些类别组成(为简单起见,在下面的示例中为一个类别),对于每个类别,我都有一个带有某些过程值的时间序列. 我只有在有观察值时才有一个值,现在我想在该日期无观察值时添加一个"0". 我发现了一种效率似乎很低的方法(堆叠和拆栈会在数百万个类别的情况下创建很多列).

Let's assume I have a MultiIndex which consists of the date and some categories (one for simplicity in the example below) and for each category I have a time series with values of some process. I only have a value when there was an observation and I now want to add a "0" whenever there was no observation on that date. I found a way which seems very inefficient (stacking and unstacking which will create many many columns in case of millions of categories).

import datetime as dt
import pandas as pd

days= 4
#List of all dates that should be in the index
all_dates = [datetime.date(2013, 2, 13) - dt.timedelta(days=x)
    for x in range(days)]
df = pd.DataFrame([
    (datetime.date(2013, 2, 10), 1, 4),
    (datetime.date(2013, 2, 10), 2, 7),
    (datetime.date(2013, 2, 11), 2, 7),
    (datetime.date(2013, 2, 13), 1, 2),
    (datetime.date(2013, 2, 13), 2, 3)],
    columns = ['date', 'category', 'value'])
df.set_index(['date', 'category'], inplace=True)
print df
print df.unstack().reindex(all_dates).fillna(0).stack()
# insert 0 values for missing dates
print all_dates

                        value
date       category       
2013-02-10 1             4
           2             7
2013-02-11 2             7
2013-02-13 1             2
           2             3

                      value
            category       
2013-02-13 1             2
           2             3
2013-02-12 1             0
           2             0
2013-02-11 1             0
           2             7
2013-02-10 1             4
           2             7
[datetime.date(2013, 2, 13), datetime.date(2013, 2, 12),
    datetime.date(2013, 2, 11),     datetime.date(2013, 2, 10)]

有人知道实现这一目标的更聪明的方法吗?

Does anybody know a smarter way to achieve the same?

我发现实现这一目标的另一种可能性:

I found another possibility to achieve the same:

import datetime as dt
import pandas as pd

days= 4
#List of all dates that should be in the index
all_dates = [datetime.date(2013, 2, 13) - dt.timedelta(days=x) for x in range(days)]
df = pd.DataFrame([(datetime.date(2013, 2, 10), 1, 4, 5),
(datetime.date(2013, 2, 10), 2,1, 7),
(datetime.date(2013, 2, 10), 2,2, 7),
(datetime.date(2013, 2, 11), 2,3, 7),
(datetime.date(2013, 2, 13), 1,4, 2),
(datetime.date(2013, 2, 13), 2,4, 3)],
columns = ['date', 'category', 'cat2', 'value'])
date_col = 'date'
other_index = ['category', 'cat2']
index = [date_col] + other_index
df.set_index(index, inplace=True)
grouped = df.groupby(level=other_index)
df_list = []
for i, group in grouped:
    df_list.append(group.reset_index(level=other_index).reindex(all_dates).fillna(0))
print pd.concat(df_list).set_index(other_index, append=True)

                    value
           category cat2       
2013-02-13 1        4         2
2013-02-12 0        0         0
2013-02-11 0        0         0
2013-02-10 1        4         5
2013-02-13 0        0         0
2013-02-12 0        0         0
2013-02-11 0        0         0
2013-02-10 2        1         7
2013-02-13 0        0         0
2013-02-12 0        0         0
2013-02-11 0        0         0
2013-02-10 2        2         7
2013-02-13 0        0         0
2013-02-12 0        0         0
2013-02-11 2        3         7
2013-02-10 0        0         0
2013-02-13 2        4         3
2013-02-12 0        0         0
2013-02-11 0        0         0
2013-02-10 0        0         0

推荐答案

您可以基于所需索引级别的笛卡尔积来创建新的多重索引.然后,使用新索引重新索引您的数据框.

You can make a new multi index based on the Cartesian product of the index levels you want. Then, re-index your data frame using the new index.

(date_index, category_index) = df.index.levels
new_index = pd.MultiIndex.from_product([all_dates, category_index])
new_df = df.reindex(new_index)

# Optional: convert missing values to zero, and convert the data back
# to integers. See explanation below.
new_df = new_df.fillna(0).astype(int)

就是这样!新的数据帧具有所有可能的索引值.现有数据已正确索引.

That's it! The new data frame has all the possible index values. The existing data is indexed correctly.

请继续阅读以获取更详细的说明.

Read on for a more detailed explanation.

import datetime as dt
import pandas as pd

days= 4
#List of all dates that should be in the index
all_dates = [dt.date(2013, 2, 13) - dt.timedelta(days=x)
    for x in range(days)]
df = pd.DataFrame([
    (dt.date(2013, 2, 10), 1, 4),
    (dt.date(2013, 2, 10), 2, 7),
    (dt.date(2013, 2, 11), 2, 7),
    (dt.date(2013, 2, 13), 1, 2),
    (dt.date(2013, 2, 13), 2, 3)],
    columns = ['date', 'category', 'value'])
df.set_index(['date', 'category'], inplace=True)

这是示例数据的样子

                     value
date       category
2013-02-10 1             4
           2             7
2013-02-11 2             7
2013-02-13 1             2
           2             3

创建新索引

使用 from_product ,我们可以新的多重索引.此新索引是您传递给函数的所有值的笛卡尔积.

Make new index

Using from_product we can make a new multi index. This new index is the Cartesian product of all the values you pass to the function.

(date_index, category_index) = df.index.levels

new_index = pd.MultiIndex.from_product([all_dates, category_index])

重新编制索引

使用新索引重新索引现有数据框.

Reindex

Use the new index to reindex the existing data frame.

所有可能的组合现在都存在.缺少的值为空(NaN).

All the possible combinations are now present. The missing values are null (NaN).

new_df = df.reindex(new_index)

现在,扩展的,重新索引的数据框如下所示:

Now, the expanded, re-indexed data frame looks like this:

              value
2013-02-13 1    2.0
           2    3.0
2013-02-12 1    NaN
           2    NaN
2013-02-11 1    NaN
           2    7.0
2013-02-10 1    4.0
           2    7.0

整数列为空

您可以看到新数据框中的数据已从int转换为float. 熊猫在整数列中不能包含空值. (可选)我们可以将所有的null都转换为0,然后将数据转换回整数.

Nulls in integer column

You can see that the data in the new data frame has been converted from ints to floats. Pandas can't have nulls in an integer column. Optionally, we can convert all the nulls to 0, and cast the data back to integers.

new_df = new_df.fillna(0).astype(int)

结果

              value
2013-02-13 1      2
           2      3
2013-02-12 1      0
           2      0
2013-02-11 1      0
           2      7
2013-02-10 1      4
           2      7

这篇关于在MultiIndex中插入缺失值的0值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆