如何对 Pandas 中的缺失值求和? [英] How to sum with missing values in Pandas?

查看:85
本文介绍了如何对 Pandas 中的缺失值求和?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对 Pandas Series 对象求和,但得到的结果似乎与文档所说的不一样.

在 Pandas 0.19.2 中,代码如下:

a = pd.Series({1: 2, 3: 4})b = pd.Series({3: 5, 4: 6})打印(a + b)

给我,

1 NaN3 9.04 南数据类型:float64

但是,文档 说:

<块引用>

对数据求和时,NA(缺失)值将被视为零

这似乎将它们视为 NaN 而不是零.我期待输出:

1 2.03 9.04 6.0数据类型:float64

在我的情况下,系列来自多个列的 value_counts() 并且我想使用 sum() 但它给了我 NaN 对于所有不在所有列中都有值,这是错误的.每行应该有一个整数.

对我来说另一个谜是为什么结果有 dtype float:

a.dtype, b.dtype, (a+b).dtype

给予,

(dtype('int64'), dtype('int64'), dtype('float64'))

这让我很惊讶.

如果我确保 ab 具有相同的行,那么生成的 dtype 是 int64.所以改float显然只是为了允许NaN值,这有点令人震惊.

编辑 2:修复预期输出中的错误.

解决方案

文档中的声明指的是减少金额,即:

<预><代码>>>>a + b1 纳米3 9.04 南数据类型:float64>>>(a + b).sum()9.0 # nans 被视为零...

未矢量化的总和.您必须明确地执行此操作:

<预><代码>>>>(a + b).fillna(0)1 0.03 9.04 0.0数据类型:float64

至于float的提升,这是一个常见的pandas陷阱,你可以阅读这里

根据您的问题描述,即汇总跨列的值计数,您可能希望将 fill_value 添加到添加中,pd.Series.add 方法让你做:

<预><代码>>>>a.add(b,fill_value=0)1 2.03 9.04 6.0数据类型:float64

请注意,不幸的是,由于 NaNs,它仍然会进行类型提升.如果这是一个问题,您可以轻松解决它:

<预><代码>>>>a.add(b, fill_value=0).astype(np.int)1 23 94 6数据类型:int64

I want to sum Pandas Series objects, but I get weird results that seem not to be what the documentation says.

In Pandas 0.19.2, the following code:

a = pd.Series({1: 2, 3: 4})
b = pd.Series({3: 5, 4: 6})
print(a + b)

gives me,

1    NaN
3    9.0
4    NaN
dtype: float64

however, the documentation says:

When summing data, NA (missing) values will be treated as zero

This seems to treat them as NaN rather than zeros. I was expecting the output:

1    2.0
3    9.0
4    6.0
dtype: float64

In my case the Series comes from value_counts() over several columns and I wanted to use sum() but it gives me NaN for all rows that don't have values in all columns, which is wrong. There should be an integer for every row.

Another mystery for me is why the result has dtype float:

a.dtype, b.dtype, (a+b).dtype

gives,

(dtype('int64'), dtype('int64'), dtype('float64'))

which is quite surprising to me.

Edit: if I make sure that a and b have the same rows, then the resulting dtype is int64. So the change to float is clearly just to allow for the NaN value, which is a bit shocking.

Edit 2: Fix mistake in the expected output.

解决方案

The claim from the documentation refers to reducing sums, i.e:

>>> a + b
1    NaN
3    9.0
4    NaN
dtype: float64
>>> (a + b).sum()
9.0 # nans treated as zero...

Not vectorized sums. You'll have to do this explicitely:

>>> (a + b).fillna(0)
1    0.0
3    9.0
4    0.0
dtype: float64

As for the promotion to float, that is a common pandas gotcha, which you can read about here

Given your problem description, i.e. summarizing value counts across columns, you may want to add a fill_value to the addition, which the pd.Series.add method lets you do:

>>> a.add(b, fill_value=0)
1    2.0
3    9.0
4    6.0
dtype: float64

Note, unfortunately, it still does type-promotion due to NaNs. If it is an issue you can easily fix it:

>>> a.add(b, fill_value=0).astype(np.int)
1    2
3    9
4    6
dtype: int64

这篇关于如何对 Pandas 中的缺失值求和?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆