仅使用 pandas 来填补空白,而不能使用NaN来填补空白 [英] Using pandas to fill gaps only, and not NaNs on the ends

查看:81
本文介绍了仅使用 pandas 来填补空白,而不能使用NaN来填补空白的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些跨越约8个月的房价数据,并跟踪房屋上市到出售之前的价格.我想填充中间的数据中的一些空白,但是我想让每一个末尾的NaN保持不变.

I have some housing price data that spans about 8 months, and tracks the price as houses come onto the market up until they are sold. There are a couple gaps in the data in the middle that I'd like to fill in, but I'd like to leave the NaNs on the end of each untouched.

举一个简单的例子,假设我们有一个house1,它在第4天"的价格为200000,而在第9天"的价格为190000.而且我们的house2在第1天到第12天的价格保持在180000,并且在该时间段内没有出售.但是,在第6天和第7天出了点问题,我丢失了数据:

To use a simple example, let's say we have house1 that comes on the market for 200000 on 'Day 4', and sells for 190000 on 'Day 9'. And we have house2 that stays at 180000 for Days 1 - 12 and doesn't sell in that time window. But, something went wrong on days 6 and 7 and I lost the data:

house1 = [NaN, NaN, NaN, 200000, 200000, NaN, NaN, 200000, 190000, NaN, NaN, NaN]
house2 = [180000, 180000, 180000, 180000, 180000, NaN, NaN, 180000, 180000, 180000, 180000, 180000]

现在想象一下,而不是常规数组,它们是按日期索引的Pandas Dataframes列.

Now imagine instead of regular arrays these were columns in Pandas Dataframes indexed by date.

麻烦的是,我通常用来填补空白的功能是

The trouble is, the function I would normally use to fill the gaps here would be DataFrame.fillna() using either the backfill or ffill methods. If I use ffill, house1 returns this:

house1 = [NaN, NaN, NaN, 200000, 200000, 200000, 200000, 200000, 190000, 190000, 190000, 190000]

这填补了空白,但也错误地填充了销售当天之后的数据.如果我改用回填,则会显示以下内容:

Which fills the gap, but also incorrectly fills the data past the day of sale. If I use backfill instead, I get this:

house1 = [200000, 200000, 200000, 200000, 200000, 200000, 200000, 200000, 190000, NaN, NaN, NaN]

同样,它填补了空白,但是这次它也填补了数据的前端.如果我将'limit = 2'与填充一起使用,那么我得到的是:

Again, it fills the gap, but this time it also fills the front end of the data. If I use 'limit=2' with ffill, then what I get is:

house1 = [NaN, NaN, NaN, 200000, 200000, 200000, 200000, 200000, 190000, 190000, 190000, NaN]

它再次填补了空白,但是它也开始填充数据,超出了真实"数据结束的位置.

Once again, it fills the gap, but then it also begins to fill the data beyond the end of where the 'real' data ends.

到目前为止,我的解决方案是编写以下函数:

My solution so far was to write the following function:

def fillGaps(houseDF):
    """Fills up holes in the housing data"""

    def fillColumns(column):
        filled_col = column
        lastValue = None
        # Keeps track of if we are dealing with a gap in numbers
        gap = False
        i = 0
        for currentValue in filled_col:
            # Loops over all the nans before the numbers begin
            if not isANumber(currentValue) and lastValue is None:
                pass
            # Keeps track of the last number we encountered before a gap
            elif isANumber(currentValue) and (gap is False):
                lastIndex = i
                lastValue = currentValue
            # Notes when we encounter a gap in numbers
            elif not isANumber(currentValue):
                gap = True
            # Fills in the gap
            elif isANumber(currentValue):
                gapIndicies = range(lastIndex + 1, i)
                for j in gapIndicies:
                    filled_col[j] = lastValue
                gap = False
            i += 1
        return filled_col

    filled_df = houseDF.apply(fillColumns, axis=0)
    return filled_df

它只是跳过前面的所有NaN,填充空白(由实际值之间的NaN组定义),而最后不填充NaN.

It simply skips all the NaNs in front, fills in the gaps (defined by groups of NaNs between real values), and doesn't fill in NaNs on the end.

是否有更清洁的方法来执行此操作,或者我不知道内置的熊猫功能?

Is there a cleaner way to do this, or a built-in pandas function I'm unaware of?

推荐答案

您可以在系列的某些部分上使用fillna.根据您的描述,fillna仅应在第一个非NaN之后和最后一个非NaN之前填充NaN:

You can use fillna on certain parts of the Series. Based on your description, fillna should only fill up the NaNs after the first non-NaN, and before the last non-NaN:

import numpy as np
import pandas as pd


def fill_column(house):
    house = house.copy()
    non_nans = house[~house.apply(np.isnan)]
    start, end = non_nans.index[0], non_nans.index[-1]
    house.ix[start:end] = house.ix[start:end].fillna(method='ffill')
    return house


house1 = pd.Series([np.nan, np.nan, np.nan, 200000, 200000, np.nan, np.nan, 200000, 190000, np.nan, np.nan, np.nan])
print fill_column(house1)

输出:

0        NaN
1        NaN
2        NaN
3     200000
4     200000
5     200000
6     200000
7     200000
8     190000
9        NaN
10       NaN
11       NaN

请注意,这假设该系列包含至少两个非NaN,分别对应于第一天和最后一天的价格.

Note that this assumes that the Series contains at least two non-NaNs, corresponding to the prices on the first and last day.

这篇关于仅使用 pandas 来填补空白,而不能使用NaN来填补空白的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆