如何计算根据条件选择的一组行中的元素数? [英] How to count the number of elements in a set of rows selected based on a condition?
问题描述
我有一个很大的DataFrame
,其中有很多重复的值.唯一值存储在List1
中.我想执行以下操作:
I have a large DataFrame
with many duplicate values. The unique values are stored in List1
. I'd like to do the following:
- 选择几行,其中包含列表中存在的每个值.
- 遍历所选行并计算非NaN元素的数量
- 如果计数值大于或等于2,则将其存储在新列表中.仅当
all 的"eq"计数值> = 2时,才应将 List1
中的每个组件添加到eq_list
中.
- Select a few rows that contain each of the values present in the list.
- Iterate over the selected rows and count the number of non NaN elements
- If the count value is greater than or equal to 2, store it in a new list. Each component in
List1
should be added toeq_list
only if all the count values for the 'eq' are >=2.
简化的示例输入:
List1 = ['A','B','C','D','E','F','G','H','X','Y','Z']
Sample DF 'ABC':
EQ1 EQ2 EQ3
0 A NaN NaN
1 X Y NaN
2 A X C
3 D E F
4 G H B
期望的输出:
eq_list = ['B','C','D','E','F','G','H','X','Y']
我尝试过的小码:
for eq in List1:
MCS=ABC.loc[MCS_old[:] ==eq]
MCS = MCS.reset_index(drop=True)
for index_new in range(0,len(MCS)-1):
if int(MCS.iloc[[index_new]].count(axis=1))>2:
eq_list.append(raw_input(eq))
print(eq_list)
我希望我已经把问题弄清楚了.
I hope that I have made the issue clear.
推荐答案
以下内容标识(唯一)值的set
出现在具有超过2个非NaN
值的行中,并消除了在少于2个non NaN
值的行.避免使用循环.
The below identifies the set
of (unique) values that occur in rows with more than 2 non-NaN
values, eliminates those that also occur in rows with less than 2 nonNaN
values. Avoids using loops.
首先,在df
的不满足缺失值限制的部分中获取唯一值的set
(并添加.strip()
来解决注释中提到的数据问题):
First, get set
of unique values in the part of df
that does not meet the missing values restriction (and adding .strip()
to address a data issue mentioned in the comments):
na_threshold = 1
not_enough_non_nan = df[df.count(axis=1) <= 1].values.flatten().astype(str)
not_enough_non_nan = set([str(l).strip() for l in not_enough_non_nan if not l == 'nan'])
{'A'}
接下来,确定确实符合您的限制的值set
:
Next, identify the set
of values that do meet your restriction:
enough_non_nan = df[df.count(axis=1) > 1].values.flatten().astype(str)
enough_non_nan = set([str(l).strip() for l in enough_non_nan if not l == 'nan'])
{'H', 'C', 'E', 'B', 'D', 'X', 'F', 'A', 'Y', 'G'}
最后,取上述两者之间的set
差异来消除不总是满足限制的值:
Finally, take the set
difference between the above to eliminate values do not always meet the restriction:
result = sorted(enough_non_nan - not_enough_non_nan)
['B', 'C', 'D', 'E', 'F', 'G', 'H', 'X', 'Y']
这篇关于如何计算根据条件选择的一组行中的元素数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!