pandas -检查每行的多列中是否存在值 [英] Pandas - check if a value exists in multiple columns for each row
问题描述
我有以下Pandas数据框:
I have the following Pandas dataframe:
Index Name ID1 ID2 ID3
1 A Y Y Y
2 B Y Y
3 B Y
4 C Y
我希望添加一个新列"Multiple",以指示那些在ID1,ID2和ID3列中的多个列中具有Y值的行.
I wish to add a new column 'Multiple' to indicate those rows where there is a value Y in more than one of the columns ID1, ID2, and ID3.
Index Name ID1 ID2 ID3 Multiple
1 A Y Y Y Y
2 B Y Y Y
3 B Y N
4 C Y N
我通常会使用 np.where
或 np.select
例如:
df['multiple'] = np.where(<More than 1 of ID1, ID2 or ID3 have a Y in>), 'Y', 'N')
但是我不知道如何写条件语句.ID列的数量可能越来越多,因此我无法将每种组合作为单独的条件(例如(ID1 = Y和ID3 = Y)或(ID2 = Y和ID3 = Y)
)进行介绍.我想我可能想要一些在命名列中计算Y值的东西?
but I can't figure out how to write the conditional. There might be a growing number of ID columns so I couldn't cover every combination as a separate condition (e.g. (ID1 = Y and ID3 = Y) or (ID2 = Y and ID3 = Y)
. I think I perhaps want something which counts the Y values across named columns?
在Pandas之外,我会考虑使用一个列表,将每个列的值附加到Y处,然后查看列表的长度是否大于1.
Outside of Pandas I would think about working with a list, appending the values for each column where Y and then see if the list had a length of greater than 1.
但是我无法考虑如何在 np.where
, np.select
或 df.loc
的限制内进行操作.有指针吗?
But I cant think how to do it within the limitations of np.where
, np.select
or df.loc
.
Any pointers?
推荐答案
使用numpy逐行求和Y的出现,应该做到这一点:
using numpy to sum by row to occurrences of Y should do it:
df['multi'] = ['Y' if x > 1 else 'N' for x in np.sum(df.values == 'Y', 1)]
输出:
Name ID1 ID2 ID3 multi
Index
1 A Y Y Y Y
2 B Y Y None Y
3 B Y None None N
4 C Y None None N
这篇关于 pandas -检查每行的多列中是否存在值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!