使用 Pandas 找出 2 列与 Null 之间的差异 [英] Find difference between 2 columns with Nulls using pandas

查看:83
本文介绍了使用 Pandas 找出 2 列与 Null 之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 Pandas DataFrame 中找到 2 列 int 类型之间的区别.我正在使用 python 2.7.列如下 -

<预><代码>>>>dfINVOICED_QUANTITY QUANTITY_SHIPPED0 15 南1 20 南2 7 NaN3 7 南4 7 NaN

现在,我想从 INVOICED_QUANTITY 中减去 QUANTITY_SHIPPED &我做以下-

<预><代码>>>>df['Diff'] = df['QUANTITY_INVOICED'] - df['SHIPPED_QUANTITY']>>>dfQUANTITY_INVOICED SHIPPED_QUANTITY 差异0 15 南 南1 20 NaN NaN2 7 NaN 南3 7 NaN 南4 7 NaN 南

我该如何处理 NaN?我想得到以下结果,因为我希望 NaN 被视为 0(零)-

<预><代码>>>>dfQUANTITY_INVOICED SHIPPED_QUANTITY 差异0 15 南 151 20 南 202 7 NaN 73 7 南 74 7 南 7

我不想做一个df.fillna(0).总而言之,我会尝试类似以下的内容 &它有效但没有区别 -

<预><代码>>>>df['Sum'] = df[['QUANTITY_INVOICED', 'SHIPPED_QUANTITY']].sum(axis=1)>>>dfINVOICED_QUANTITY QUANTITY_SHIPPED 差异总和0 15 南 南 151 20 南 南 202 7 南 南 73 7 南 南 74 7 南 南 7

解决方案

您可以使用 sub 方法执行减法 - 此方法允许处理 NaN 值作为指定值:

df['Diff'] = df['INVOICED_QUANTITY'].sub(df['QUANTITY_SHIPPED'], fill_value=0)

产生:

 INVOICED_QUANTITY QUANTITY_SHIPPED 差异0 15 南 151 20 南 202 7 NaN 73 7 南 74 7 南 7

<小时>

另一种巧妙的方法是 @JianxunLi 建议:填写列中的缺失值(创建列的副本)并照常减去.

这两种方法几乎相同,尽管 sub 效率更高一些,因为它不需要提前生成列的副本;它只是即时"填充缺失值:

在 [46]: %timeit df['INVOICED_QUANTITY'] - df['QUANTITY_SHIPPED'].fillna(0)10000 个循环,最好的 3 个:每个循环 144 µs在 [47]: %timeit df['INVOICED_QUANTITY'].sub(df['QUANTITY_SHIPPED'], fill_value=0)10000 个循环,最好的 3 个:每个循环 81.7 µs

I want to find the difference between 2 columns of type int in a pandas DataFrame. I am using python 2.7. The columns are as below -

>>> df
   INVOICED_QUANTITY  QUANTITY_SHIPPED
0                 15               NaN
1                 20               NaN
2                  7               NaN
3                  7               NaN
4                  7               NaN

Now, I want to subtract QUANTITY_SHIPPED from INVOICED_QUANTITY & I do the below-

>>> df['Diff'] = df['QUANTITY_INVOICED'] - df['SHIPPED_QUANTITY']
>>> df
   QUANTITY_INVOICED  SHIPPED_QUANTITY  Diff
0                 15               NaN   NaN
1                 20               NaN   NaN
2                  7               NaN   NaN
3                  7               NaN   NaN
4                  7               NaN   NaN

How do I take care of the NaN's? I would like to get the below as result as I want NaN's to be treated as 0 (zero)-

>>> df
       QUANTITY_INVOICED  SHIPPED_QUANTITY  Diff
    0                 15               NaN   15
    1                 20               NaN   20
    2                  7               NaN   7
    3                  7               NaN   7
    4                  7               NaN   7

I do not want to do a df.fillna(0). For sum I would try something like the following & it works but not for difference -

>>> df['Sum'] = df[['QUANTITY_INVOICED', 'SHIPPED_QUANTITY']].sum(axis=1)
>>> df
   INVOICED_QUANTITY  QUANTITY_SHIPPED  Diff  Sum
0                 15               NaN   NaN   15
1                 20               NaN   NaN   20
2                  7               NaN   NaN    7
3                  7               NaN   NaN    7
4                  7               NaN   NaN    7

解决方案

You can use the sub method to perform the subtraction - this method allows NaN values to be treated as a specified value:

df['Diff'] = df['INVOICED_QUANTITY'].sub(df['QUANTITY_SHIPPED'], fill_value=0)

Which produces:

   INVOICED_QUANTITY  QUANTITY_SHIPPED  Diff
0                 15               NaN    15
1                 20               NaN    20
2                  7               NaN     7
3                  7               NaN     7
4                  7               NaN     7


The other neat way to do this is as @JianxunLi suggests: fill in the missing values in the column (creating a copy of the column) and subtract as normal.

The two approaches are almost the same, although sub is a little more efficient because it doesn't need to produce a copy of the column in advance; it just fills the missing values "on the fly":

In [46]: %timeit df['INVOICED_QUANTITY'] - df['QUANTITY_SHIPPED'].fillna(0)
10000 loops, best of 3: 144 µs per loop

In [47]: %timeit df['INVOICED_QUANTITY'].sub(df['QUANTITY_SHIPPED'], fill_value=0)
10000 loops, best of 3: 81.7 µs per loop

这篇关于使用 Pandas 找出 2 列与 Null 之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆