是否有更好的可读性来分析 pandas 中的煤柱 [英] Is there a better more readable way to coalese columns in pandas
问题描述
我经常需要一个新的列,这是我可以从其他列中获得的最好的列,并且我有一个特定的优先级列表.我愿意采用第一个非null值.
I often need a new column that is the best I can achieve from other columns and I have a specific list of preference priorities. I am willing to take the first non null value.
def coalesce(values):
not_none = (el for el in values if el is not None)
return next(not_none, None)
df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
{'third':'B','first':None,'second':'C'},
{'third':'B','first':None,'second':None},
{'third':None,'first':None,'second':None},
{'third':'B','first':'A','second':None}])
df['combo1'] = df.apply(coalesce, axis=1)
df['combo2'] = df[['second','third','first']].apply(coalesce, axis=1)
print df
结果
first second third combo1 combo2
0 A C B A C
1 None C B C C
2 None None B B B
3 None None None None None
4 A None B A B
此代码有效(结果就是我想要的),但是速度不是很快.
如果需要[['第二','第三','第一']]
this code works (and the result are what I want) but it is not very fast.
I get to pick my priorities if I need to [['second','third','first']]
Coalesce有点像tsql中同名的函数.
我怀疑我可能忽略了在大型DataFrame(+400,000行)上以良好的性能实现此目标的简单方法
Coalesce somewhat like the function of the same name from tsql.
I suspect that I may have overlooked an easy way to achieve it with good performance on large DataFrames (+400,000 rows)
我知道有很多方法可以填充丢失的数据,而我经常在axis = 0上使用 这就是让我认为我可能错过了axis = 1的简单选择的原因
I know there are lots of ways to fill in missing data which I often use on axis=0 this is what makes me think I may have missed an easy option for axis=1
您能建议一些更好/更快的方法吗?还是确认它是否达到最佳效果.
Can you suggest something nicer/faster... or confirm that this is as good as it gets.
推荐答案
您可以使用pd.isnull
查找空值-在这种情况下为None
-值:
You could use pd.isnull
to find the null -- in this case None
-- values:
In [169]: pd.isnull(df)
Out[169]:
first second third
0 False False False
1 True False False
2 True True False
3 True True True
4 False True False
,然后使用np.argmin
查找第一个非空值的索引.如果所有值都为空,则np.argmin
返回0:
and then use np.argmin
to find the index of the first non-null value. If all the values are null, np.argmin
returns 0:
In [186]: np.argmin(pd.isnull(df).values, axis=1)
Out[186]: array([0, 1, 2, 0, 0])
然后,您可以使用NumPy整数索引从df
中选择所需的值:
Then you could select the desired values from df
using NumPy integer-indexing:
In [193]: df.values[np.arange(len(df)), np.argmin(pd.isnull(df).values, axis=1)]
Out[193]: array(['A', 'C', 'B', None, 'A'], dtype=object)
例如,
For example,
import pandas as pd
df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
{'third':'B','first':None,'second':'C'},
{'third':'B','first':None,'second':None},
{'third':None,'first':None,'second':None},
{'third':'B','first':'A','second':None}])
mask = pd.isnull(df).values
df['combo1'] = df.values[np.arange(len(df)), np.argmin(mask, axis=1)]
order = np.array([1,2,0])
mask = mask[:, order]
df['combo2'] = df.values[np.arange(len(df)), order[np.argmin(mask, axis=1)]]
收益
first second third combo1 combo2
0 A C B A C
1 None C B C C
2 None None B B B
3 None None None None None
4 A None B A B
如果DataFrame有很多行,使用
使用argmin代替df3.apply(coalesce, ...)
会明显更快:
Using argmin instead of df3.apply(coalesce, ...)
is significantly quicker if the DataFrame has a lot of rows:
df2 = pd.concat([df]*1000)
In [230]: %timeit mask = pd.isnull(df2).values; df2.values[np.arange(len(df2)), np.argmin(mask, axis=1)]
1000 loops, best of 3: 617 µs per loop
In [231]: %timeit df2.apply(coalesce, axis=1)
10 loops, best of 3: 84.1 ms per loop
这篇关于是否有更好的可读性来分析 pandas 中的煤柱的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!