如何计算 pandas 数据框中各列的非NaN值? [英] How to count non NaN values accross columns in pandas dataframe?
问题描述
我的数据如下:
Close a b c d e Time
2015-12-03 2051.25 5 4 3 1 1 05:00:00
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00
我需要水平"计数[Na]以外的[a]到[e]列中的值.结果就是这样:
I need to count 'horizontally' the values in the columns ['a'] to ['e'] that are not NaN. So the outcome would be this:
df['Count'] = .....
df
Close a b c d e Time Count
2015-12-03 2051.25 5 4 3 1 1 05:00:00 5
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 4
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 3
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 1
谢谢
推荐答案
您可以从df中进行子选择,并通过axis=1
来调用count
:
You can subselect from your df and call count
passing axis=1
:
In [24]:
df['count'] = df[list('abcde')].count(axis=1)
df
Out[24]:
Close a b c d e Time count
2015-12-03 2051.25 5 4 3 1 1 05:00:00 5
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 4
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 3
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 1
时间
In [25]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
100 loops, best of 3: 3.28 ms per loop
100 loops, best of 3: 2.76 ms per loop
100 loops, best of 3: 2.98 ms per loop
apply
是最慢的,这不足为奇,drop
版本略快,但从语义上讲,我更喜欢只传递感兴趣的cols列表并调用count
以提高可读性
apply
is the slowest which is not a surprise, the drop
version is marginally faster but semantically I prefer just passing the list of cols of interest and calling count
for readability
嗯,我现在的时机不断变化:
Hmm I keep getting varying timings now:
In [27]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
100 loops, best of 3: 3.33 ms per loop
100 loops, best of 3: 2.7 ms per loop
100 loops, best of 3: 2.7 ms per loop
100 loops, best of 3: 2.57 ms per loop
更多时间
In [160]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
%timeit df[list('abcde')].notnull().sum(axis=1)
1000 loops, best of 3: 1.4 ms per loop
1000 loops, best of 3: 1.14 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.05 ms per loop
在该数据集上,测试notnull
并求和(因为notnull
将产生布尔掩码)似乎更快
It seems that testing for notnull
and summing (as notnull
will produce a boolean mask) is quicker on this dataset
在5万行df中,最后一种方法要快一些:
On a 50k row df the last method is slightly quicker:
In [172]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
%timeit df[list('abcde')].notnull().sum(axis=1)
1 loops, best of 3: 5.83 s per loop
100 loops, best of 3: 6.15 ms per loop
100 loops, best of 3: 6.49 ms per loop
100 loops, best of 3: 6.04 ms per loop
这篇关于如何计算 pandas 数据框中各列的非NaN值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!