是否有更好的可读性来分析 pandas 中的煤柱 [英] Is there a better more readable way to coalese columns in pandas

查看：42 发布时间：2020/5/24 1:56:54 python pandas

本文介绍了是否有更好的可读性来分析 pandas 中的煤柱的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我经常需要一个新的列，这是我可以从其他列中获得的最好的列，并且我有一个特定的优先级列表.我愿意采用第一个非null值.

I often need a new column that is the best I can achieve from other columns and I have a specific list of preference priorities. I am willing to take the first non null value.

def coalesce(values):
    not_none = (el for el in values if el is not None)
    return next(not_none, None)

df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
                   {'third':'B','first':None,'second':'C'},
                   {'third':'B','first':None,'second':None},                   
                   {'third':None,'first':None,'second':None},
                   {'third':'B','first':'A','second':None}])

df['combo1'] = df.apply(coalesce, axis=1)
df['combo2'] = df[['second','third','first']].apply(coalesce, axis=1)
print df

结果

  first second third combo1 combo2
0     A      C     B      A      C
1  None      C     B      C      C
2  None   None     B      B      B
3  None   None  None   None   None
4     A   None     B      A      B

此代码有效(结果就是我想要的)，但是速度不是很快.
如果需要[['第二'，'第三'，'第一']]

this code works (and the result are what I want) but it is not very fast.
I get to pick my priorities if I need to [['second','third','first']]

Coalesce有点像tsql中同名的函数.
我怀疑我可能忽略了在大型DataFrame(+400,000行)上以良好的性能实现此目标的简单方法

Coalesce somewhat like the function of the same name from tsql.
I suspect that I may have overlooked an easy way to achieve it with good performance on large DataFrames (+400,000 rows)

我知道有很多方法可以填充丢失的数据，而我经常在axis = 0上使用这就是让我认为我可能错过了axis = 1的简单选择的原因

I know there are lots of ways to fill in missing data which I often use on axis=0 this is what makes me think I may have missed an easy option for axis=1

您能建议一些更好/更快的方法吗?还是确认它是否达到最佳效果.

Can you suggest something nicer/faster... or confirm that this is as good as it gets.

推荐答案

您可以使用pd.isnull查找空值-在这种情况下为None-值:

You could use pd.isnull to find the null -- in this case None -- values:

In [169]: pd.isnull(df)
Out[169]: 
   first second  third
0  False  False  False
1   True  False  False
2   True   True  False
3   True   True   True
4  False   True  False

，然后使用np.argmin查找第一个非空值的索引.如果所有值都为空，则np.argmin返回0:

and then use np.argmin to find the index of the first non-null value. If all the values are null, np.argmin returns 0:

In [186]: np.argmin(pd.isnull(df).values, axis=1)
Out[186]: array([0, 1, 2, 0, 0])

然后，您可以使用NumPy整数索引从df中选择所需的值:

Then you could select the desired values from df using NumPy integer-indexing:

In [193]: df.values[np.arange(len(df)), np.argmin(pd.isnull(df).values, axis=1)]
Out[193]: array(['A', 'C', 'B', None, 'A'], dtype=object)

例如，

For example,

import pandas as pd
df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
                   {'third':'B','first':None,'second':'C'},
                   {'third':'B','first':None,'second':None},                   
                   {'third':None,'first':None,'second':None},
                   {'third':'B','first':'A','second':None}])

mask = pd.isnull(df).values
df['combo1'] = df.values[np.arange(len(df)), np.argmin(mask, axis=1)]
order = np.array([1,2,0])
mask = mask[:, order]
df['combo2'] = df.values[np.arange(len(df)), order[np.argmin(mask, axis=1)]]

收益

  first second third combo1 combo2
0     A      C     B      A      C
1  None      C     B      C      C
2  None   None     B      B      B
3  None   None  None   None   None
4     A   None     B      A      B

如果DataFrame有很多行，使用

使用argmin代替df3.apply(coalesce, ...)会明显更快:

Using argmin instead of df3.apply(coalesce, ...) is significantly quicker if the DataFrame has a lot of rows:

df2 = pd.concat([df]*1000)

In [230]: %timeit mask = pd.isnull(df2).values; df2.values[np.arange(len(df2)), np.argmin(mask, axis=1)]
1000 loops, best of 3: 617 µs per loop

In [231]: %timeit df2.apply(coalesce, axis=1)
10 loops, best of 3: 84.1 ms per loop

这篇关于是否有更好的可读性来分析 pandas 中的煤柱的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

是否有更好的可读性来分析 pandas 中的煤柱 [英] Is there a better more readable way to coalese columns in pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

是否有更好的可读性来分析 pandas 中的煤柱 [英] Is there a better more readable way to coalese columns in pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭