用 pandas 识别连续的NaN [英] Identifying consecutive NaN's with pandas
问题描述
我正在读取一堆CSV文件(一段时间内水位的测量数据)以对其进行各种分析和可视化.
由于无法控制的各种原因,这些时间序列通常缺少数据,因此我要做两件事:
我总共算出他们
Rlength=len(RainD) #counts everything, including NaN
Rcount=RainD.count() #counts only valid numbers
NaN_Number=Rlength-Rcount
如果我缺少的数据多于某个阈值,则丢弃数据集:
Percent_Data=Rlength/100
Five_Percent=Percent_Data*5
if NaN_Number > Five_Percent:
...
如果NaN的数量足够少,我想用
填补空白RainD.level=RainD.level.fillna(method='pad',limit=2)
现在要解决的是:它的月度数据,因此,如果我连续有两个以上的NaN,我也想丢弃该数据,因为那将意味着我猜测"了整个季节,甚至更多. >
fillna
并没有真正提到当连续的NaN比我指定的limit=2
多时发生的情况,但是当我查看...fillna...
之前和之后的RainD.describe()
并将其与基本CSV进行比较时,会清楚地看到它会填充前两个NaN,然后其余的保留不变,而不是出错.
长话短说:
如何在没有一些复杂且耗时的非熊猫循环的情况下,识别出连续多个带有熊猫的NaN?
您可以使用多个布尔条件来测试当前值和先前值是否为NaN
:
In [3]:
df = pd.DataFrame({'a':[1,3,np.NaN, np.NaN, 4, np.NaN, 6,7,8]})
df
Out[3]:
a
0 1
1 3
2 NaN
3 NaN
4 4
5 NaN
6 6
7 7
8 8
In [6]:
df[(df.a.isnull()) & (df.a.shift().isnull())]
Out[6]:
a
3 NaN
如果要查找连续的NaNs
出现在您要查找的位置超过2的位置,则可以执行以下操作:
In [38]:
df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df
Out[38]:
a
0 1
1 2
2 NaN
3 NaN
4 NaN
5 6
6 7
7 8
8 9
9 10
10 NaN
11 NaN
12 13
13 14
In [41]:
df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()
Out[41]:
a
1 0
2 3
3 0
4 0
5 0
6 0
7 2
8 0
9 0
Name: a, dtype: int32
I am reading in a bunch of CSV files (measurement data for water levels over time) to do various analysis and visualizations on them.
Due to various reasons beyond my control, these time series often have missing data, so I do two things:
I count them in total with
Rlength=len(RainD) #counts everything, including NaN
Rcount=RainD.count() #counts only valid numbers
NaN_Number=Rlength-Rcount
and discard the dataset if i have more missing data than a certain threshold:
Percent_Data=Rlength/100
Five_Percent=Percent_Data*5
if NaN_Number > Five_Percent:
...
If the number of NaN is sufficiently small, I would like to fill the gaps with
RainD.level=RainD.level.fillna(method='pad',limit=2)
And now for the issue: Its monthly data, so if I have more than 2 consecutive NaN, I also want to discard the data, since that would mean that I "guess" a whole season, or even more.
The documentation for fillna
doesn't really mention what happens when there is more consecutive NaN's than my specified limit=2
, but when I look at RainD.describe()
before and after ...fillna...
and compare it with the base CSV, its clear that it fills the first 2 NaN, and then leaves the rest as it is, instead of erroring out.
So, long story short:
How do I identify a number of consecutive NaN's with pandas, without some complicated and time consuming non-pandas loop?
You can use multiple boolean conditions to test if the current value and previous value are NaN
:
In [3]:
df = pd.DataFrame({'a':[1,3,np.NaN, np.NaN, 4, np.NaN, 6,7,8]})
df
Out[3]:
a
0 1
1 3
2 NaN
3 NaN
4 4
5 NaN
6 6
7 7
8 8
In [6]:
df[(df.a.isnull()) & (df.a.shift().isnull())]
Out[6]:
a
3 NaN
If you wanted to find where consecutive NaNs
occur where you are looking for more than 2 you could do the following:
In [38]:
df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df
Out[38]:
a
0 1
1 2
2 NaN
3 NaN
4 NaN
5 6
6 7
7 8
8 9
9 10
10 NaN
11 NaN
12 13
13 14
In [41]:
df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()
Out[41]:
a
1 0
2 3
3 0
4 0
5 0
6 0
7 2
8 0
9 0
Name: a, dtype: int32
这篇关于用 pandas 识别连续的NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!