在其他列中基于NaN的Python新列 [英] Python new column based on NaN in other columns

查看:57
本文介绍了在其他列中基于NaN的Python新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Python还是很陌生,这是我有史以来的第一个问题,所以请对我保持温柔!

I'm quite new to Python and this is my first ever question so please be gentle with me!

我已经尝试了其他类似问题的答案,但仍然很困难.

I have tried out answers to other similar questions but am still quite stuck.

我正在使用Pandas,我有一个数据框,该数据框是来自多个不同的SQL表的合并,看起来像这样:

I am using Pandas and I have a dataframe which is a merge from multiple different SQL tables and looks something like this:

Col_1   Col_2   Col_3   Col_4
1       NaN     NaN     NaN
2       Y       NaN     NaN
3       Z       C       S
4       NaN     B       W

我不在乎Col_2 Col_3和Col_4中的值(请注意,这些值可以是字符串,整数或对象,具体取决于列)

I don't care about the values in Col_2 Col_3 and Col_4 (note these can be strings or integers or objects depending on the column)

我只是关心这些列中的至少一个是否已填充,因此理想情况下会希望添加第五列,例如:

I just care that at least one of these columns is populated so ideally would like a fifth column like:

Col_1   Col_2   Col_3   Col_4   Col_5
1       NaN     NaN     NaN     0
2       Y       NaN     NaN     1
3       Z       C       S       1
4       NaN     B       W       1

然后我想将列Col_2放到Col_4.

Then I want to drop the columns Col_2 to Col_4.

我最初的想法类似于下面的函数,但这将我的数据帧从50000行减少到50行.我不想删除任何行.

My initial thought was something like the function below, but this is reducing my dataframe from 50000 rows to 50. I don't want to delete any rows.

def function(row):
   if (isnull.row['col_2'] and isnull.row['col_3'] and isnull.row['col_3'] is None):
      return '0'
   else:
      return '1'

df['col_5'] = df.apply(lambda row: function (row),axis=1)

任何帮助将不胜感激.

推荐答案

使用

Use any and pass param axis=1 which tests row-wise this will produce a boolean array which when converted to int will convert all True values to 1 and False values to 0, this will be much faster than calling apply which is going to iterate row-wise and will be very slow:

In [30]:

df['Col_5'] = any(df[df.columns[1:]].notnull(), axis=1).astype(int)
df
Out[30]:
   Col_1 Col_2 Col_3 Col_4  Col_5
0      1   NaN   NaN   NaN      0
1      2     Y   NaN   NaN      1
2      3     Z     C     S      1
3      4   NaN     B     W      1

In [31]:

df = df[['Col_1', 'Col_5']]
df
Out[31]:
   Col_1  Col_5
0      1      0
1      2      1
2      3      1
3      4      1

这是 any 的输出:

In [34]:

any(df[df.columns[1:]].notnull(), axis=1)
Out[34]:
array([False,  True,  True,  True], dtype=bool)

时间

In [35]:

%timeit df[df.columns[1:]].apply(lambda x: all(x.isnull()) , axis=1).astype(int)
%timeit any(df[df.columns[1:]].notnull(), axis=1).astype(int)
100 loops, best of 3: 2.46 ms per loop
1000 loops, best of 3: 1.4 ms per loop

因此,对于这样大小的df,在您的测试数据上,我的方法比其他答案快2倍以上

So on your test data for a df this size my method is over 2x faster than the other answer

更新

在运行熊猫版本 0.12.0 时,您需要调用顶级

As you are running pandas version 0.12.0 then you need to call the top level notnull version as that method is not available at df level:

any(pd.notnull(df[df.columns[1:]]), axis=1).astype(int)

我建议您进行升级,因为它将获得更多的功能和错误修复.

I suggest you upgrade as you'll get lots more features and bug fixes.

这篇关于在其他列中基于NaN的Python新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆