用向量化(更好)或其他方法比较多个 pandas 列(第一列和第二列,第三列和第四列之后,等等) [英] Compare multiple pandas columns (1st and 2nd, after 3rd and 4rth, after etc) with vectorization (better) or other method
问题描述
此代码根据condition
,var1
和var2
进行比较,并根据choices
创建Results1
(此代码运行良好):
This code compares based on condition
the var1
and var2
and creates Results1
based on choices
(this code works well):
# from: https://stackoverflow.com/questions/27474921/compare-two-columns-using-pandas?answertab=oldest#tab-top
# from: https://stackoverflow.com/questions/60099141/negation-in-np-select-condition
import pandas as pd
import numpy as np
# Creating one column from two columns. We asume that in every row there is one NaN and one value and that value fills new column.
df = pd.DataFrame({ 'var1': ['a', 'b', 'c',np.nan, np.nan],
'var2': [1, 2, np.nan , 4, np.nan],
'var3': [np.nan , "x", np.nan, "y", "z"],
'var4': [np.nan , 4, np.nan, 5, 6],
'var5': ["a", np.nan , "b", np.nan, "c"],
'var6': [1, np.nan , 2, np.nan, 3]
})
#all conditions that are connected with logical operators (&, |, etc) should be in ().
conditions = [
(df["var1"].notna()) & (df['var2'].notna()),
(pd.isna(df["var1"])) & (pd.isna(df["var2"])),
(df["var1"].notna()) & (pd.isna(df["var2"])),
(pd.isna(df["var1"])) & (df['var2'].notna())]
choices = ["Both values", np.nan, df["var1"], df["var2"]]
df['Result1'] = np.select(conditions, choices, default=np.nan)
df
看起来应该像这样:
| | var1 | var2 | var3 | var4 | var5 | var6 | Result1 |
|---:|:-------|-------:|:-------|-------:|:-------|-------:|:------------|
| 0 | a | 1 | nan | nan | a | 1 | Both values |
| 1 | b | 2 | x | 4 | nan | nan | Both values |
| 2 | c | nan | nan | nan | b | 2 | c |
| 3 | nan | 4 | y | 5 | nan | nan | 4 |
| 4 | nan | nan | z | 6 | c | 3 | nan |
现在我要比较多个熊猫列(在我的示例中,在var3
和var4
之后,在var1
和var6
之后,在var5
和var6
之后),并基于condition
和Results
列(在我的示例中为Result1
,Result2
,Result3
).我认为最好的方法应该是使用向量化(因为具有更好的性能).我要获取的df
应该如下所示:
Now I want to compare multiple pandas columns (in my example var1
and var2
, after var3
and var4
, after var5
and var6
) and based on condition
and choices
create corresponding Results
column (in my example Result1
, Result2
, Result3
). I thought the best way should be to use vectorization (because of better performance). The df
I want to get should look like:
| | var1 | var2 | var3 | var4 | var5 | var6 | Result1 | Result2 | Result3 |
|---:|:-------|-------:|:-------|-------:|:-------|-------:|:------------|:------------|:------------|
| 0 | a | 1 | nan | nan | a | 1 | Both values | nan | Both values |
| 1 | b | 2 | x | 4 | nan | nan | Both values | Both values | nan |
| 2 | c | nan | nan | nan | b | 2 | c | nan | Both values |
| 3 | nan | 4 | y | 5 | nan | nan | 4 | Both values | nan |
| 4 | nan | nan | z | 6 | c | 3 | nan | Both values | Both values |
我尝试过:
import pandas as pd
import numpy as np
# Creating one column from two columns. We asume that in every row there is one NaN and one value and that value fills new column.
df = pd.DataFrame({ 'var1': ['a', 'b', 'c',np.nan, np.nan],
'var2': [1, 2, np.nan , 4, np.nan],
'var3': [np.nan , "x", np.nan, "y", "z"],
'var4': [np.nan , 4, np.nan, 5, 6],
'var5': ["a", np.nan , "b", np.nan, "c"],
'var6': [1, np.nan , 2, np.nan, 3]
})
col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]
#all conditions that are connected with logical operators (&, |, etc) should be in ().
conditions = [
(df[col1].notna()) & (df[col2].notna()),
(pd.isna(df[col1])) & (pd.isna(df[col2])),
(df[col1].notna()) & (pd.isna(df[col2])),
(pd.isna(df[col1])) & (df[col2].notna())]
choices = ["Both values", np.nan, df[col1], df[col2]]
df[colR] = np.select(conditions, choices, default=np.nan)
购买它给了我错误:
ValueError: shape mismatch: objects cannot be broadcast to a single shape
问题:如何通过矢量化(最好是因为性能更好)或其他方法来实现我的目标?
Question: How to achieve my goal with vectorization (preferable because of better performance) or other method?
推荐答案
问题是pandas
DataFrames强制对索引进行对齐,但是df[col1]
和df[col2]
没有重叠的列.
The issue is that pandas
DataFrames force alignment on the index, but df[col1]
and df[col2]
have no overlapping columns.
在这种情况下,您确实要使用基础的numpy数组.另外,由于.isnull()
与notnull
相反,因此您可以简化很多操作.我们将合并以重新添加新列.
In this case, you really want to work with the underlying numpy arrays. Also because .isnull()
is the opposite of notnull
you can simplify this a lot. We'll concat to add the new columns back.
col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]
s1 = df[col1].isnull().to_numpy()
s2 = df[col2].isnull().to_numpy()
conditions = [~s1 & ~s2, s1 & s2, ~s1 & s2, s1 & ~s2]
choices = ["Both values", np.nan, df[col1], df[col2]]
df = pd.concat([df, pd.DataFrame(np.select(conditions, choices), columns=colR, index=df.index)], axis=1)
var1 var2 var3 var4 var5 var6 Result1 Result2 Result3
0 a 1.0 NaN NaN a 1.0 Both values NaN Both values
1 b 2.0 x 4.0 NaN NaN Both values Both values NaN
2 c NaN NaN NaN b 2.0 c NaN Both values
3 NaN 4.0 y 5.0 NaN NaN 4 Both values NaN
4 NaN NaN z 6.0 c 3.0 NaN Both values Both values
这篇关于用向量化(更好)或其他方法比较多个 pandas 列(第一列和第二列,第三列和第四列之后,等等)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!