用 pandas 识别连续的NaN [英] Identifying consecutive NaN's with pandas

查看:68
本文介绍了用 pandas 识别连续的NaN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在读取一堆CSV文件(一段时间内水位的测量数据)以对其进行各种分析和可视化.

由于无法控制的各种原因,这些时间序列通常缺少数据,因此我要做两件事:

我总共算出他们

Rlength=len(RainD)   #counts everything, including NaN
Rcount=RainD.count() #counts only valid numbers
NaN_Number=Rlength-Rcount

如果我缺少的数据多于某个阈值,则丢弃数据集:

Percent_Data=Rlength/100
Five_Percent=Percent_Data*5
if NaN_Number > Five_Percent:
    ...

如果NaN的数量足够少,我想用

填补空白

RainD.level=RainD.level.fillna(method='pad',limit=2)

现在要解决的是:它的月度数据,因此,如果我连续有两个以上的NaN,我也想丢弃该数据,因为那将意味着我猜测"了整个季节,甚至更多. >

fillna 并没有真正提到当连续的NaN比我指定的limit=2多时发生的情况,但是当我查看...fillna...之前和之后的RainD.describe()并将其与基本CSV进行比较时,会清楚地看到它会填充前两个NaN,然后​​其余的保留不变,而不是出错.

长话短说:

如何在没有一些复杂且耗时的非熊猫循环的情况下,识别出连续多个带有熊猫的NaN?

解决方案

您可以使用多个布尔条件来测试当前值和先前值是否为NaN:

In [3]:

df = pd.DataFrame({'a':[1,3,np.NaN, np.NaN, 4, np.NaN, 6,7,8]})
df
Out[3]:
    a
0   1
1   3
2 NaN
3 NaN
4   4
5 NaN
6   6
7   7
8   8
In [6]:

df[(df.a.isnull()) & (df.a.shift().isnull())]
Out[6]:
    a
3 NaN

如果要查找连续的NaNs出现在您要查找的位置超过2的位置,则可以执行以下操作:

In [38]:

df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df
Out[38]:
     a
0    1
1    2
2  NaN
3  NaN
4  NaN
5    6
6    7
7    8
8    9
9   10
10 NaN
11 NaN
12  13
13  14

In [41]:

df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()
Out[41]:
a
1    0
2    3
3    0
4    0
5    0
6    0
7    2
8    0
9    0
Name: a, dtype: int32

I am reading in a bunch of CSV files (measurement data for water levels over time) to do various analysis and visualizations on them.

Due to various reasons beyond my control, these time series often have missing data, so I do two things:

I count them in total with

Rlength=len(RainD)   #counts everything, including NaN
Rcount=RainD.count() #counts only valid numbers
NaN_Number=Rlength-Rcount

and discard the dataset if i have more missing data than a certain threshold:

Percent_Data=Rlength/100
Five_Percent=Percent_Data*5
if NaN_Number > Five_Percent:
    ...

If the number of NaN is sufficiently small, I would like to fill the gaps with

RainD.level=RainD.level.fillna(method='pad',limit=2)

And now for the issue: Its monthly data, so if I have more than 2 consecutive NaN, I also want to discard the data, since that would mean that I "guess" a whole season, or even more.

The documentation for fillna doesn't really mention what happens when there is more consecutive NaN's than my specified limit=2, but when I look at RainD.describe() before and after ...fillna... and compare it with the base CSV, its clear that it fills the first 2 NaN, and then leaves the rest as it is, instead of erroring out.

So, long story short:

How do I identify a number of consecutive NaN's with pandas, without some complicated and time consuming non-pandas loop?

解决方案

You can use multiple boolean conditions to test if the current value and previous value are NaN:

In [3]:

df = pd.DataFrame({'a':[1,3,np.NaN, np.NaN, 4, np.NaN, 6,7,8]})
df
Out[3]:
    a
0   1
1   3
2 NaN
3 NaN
4   4
5 NaN
6   6
7   7
8   8
In [6]:

df[(df.a.isnull()) & (df.a.shift().isnull())]
Out[6]:
    a
3 NaN

If you wanted to find where consecutive NaNs occur where you are looking for more than 2 you could do the following:

In [38]:

df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df
Out[38]:
     a
0    1
1    2
2  NaN
3  NaN
4  NaN
5    6
6    7
7    8
8    9
9   10
10 NaN
11 NaN
12  13
13  14

In [41]:

df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()
Out[41]:
a
1    0
2    3
3    0
4    0
5    0
6    0
7    2
8    0
9    0
Name: a, dtype: int32

这篇关于用 pandas 识别连续的NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆