如何计算根据条件选择的一组行中的元素数? [英] How to count the number of elements in a set of rows selected based on a condition?

查看:94
本文介绍了如何计算根据条件选择的一组行中的元素数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的DataFrame,其中有很多重复的值.唯一值存储在List1中.我想执行以下操作:

I have a large DataFrame with many duplicate values. The unique values are stored in List1. I'd like to do the following:

  1. 选择几行,其中包含列表中存在的每个值.
  2. 遍历所选行并计算非NaN元素的数量
  3. 如果计数值大于或等于2,则将其存储在新列表中.仅当 all 的"eq"计数值> = 2时,才应将List1中的每个组件添加到eq_list中.
  1. Select a few rows that contain each of the values present in the list.
  2. Iterate over the selected rows and count the number of non NaN elements
  3. If the count value is greater than or equal to 2, store it in a new list. Each component in List1 should be added to eq_list only if all the count values for the 'eq' are >=2.

简化的示例输入:

List1 = ['A','B','C','D','E','F','G','H','X','Y','Z']

Sample DF 'ABC':

        EQ1  EQ2   EQ3
0       A    NaN   NaN
1       X    Y     NaN
2       A    X     C
3       D    E     F
4       G    H     B

期望的输出:

eq_list = ['B','C','D','E','F','G','H','X','Y']

我尝试过的小码:

for eq in List1:
    MCS=ABC.loc[MCS_old[:] ==eq]
    MCS = MCS.reset_index(drop=True)
    for index_new in range(0,len(MCS)-1):
        if int(MCS.iloc[[index_new]].count(axis=1))>2:
            eq_list.append(raw_input(eq))
            print(eq_list)

我希望我已经把问题弄清楚了.

I hope that I have made the issue clear.

推荐答案

以下内容标识(唯一)值的set出现在具有超过2个非NaN值的行中,并消除了在少于2个non NaN值的行.避免使用循环.

The below identifies the set of (unique) values that occur in rows with more than 2 non-NaN values, eliminates those that also occur in rows with less than 2 nonNaN values. Avoids using loops.

首先,在df的不满足缺失值限制的部分中获取唯一值的set(并添加.strip()来解决注释中提到的数据问题):

First, get set of unique values in the part of df that does not meet the missing values restriction (and adding .strip() to address a data issue mentioned in the comments):

na_threshold = 1
not_enough_non_nan = df[df.count(axis=1) <= 1].values.flatten().astype(str)
not_enough_non_nan = set([str(l).strip() for l in not_enough_non_nan if not l == 'nan'])

{'A'}

接下来,确定确实符合您的限制的值set:

Next, identify the set of values that do meet your restriction:

enough_non_nan = df[df.count(axis=1) > 1].values.flatten().astype(str)
enough_non_nan = set([str(l).strip() for l in enough_non_nan if not l == 'nan'])

{'H', 'C', 'E', 'B', 'D', 'X', 'F', 'A', 'Y', 'G'}

最后,取上述两者之间的set差异来消除不总是满足限制的值:

Finally, take the set difference between the above to eliminate values do not always meet the restriction:

result = sorted(enough_non_nan - not_enough_non_nan)

['B', 'C', 'D', 'E', 'F', 'G', 'H', 'X', 'Y']

这篇关于如何计算根据条件选择的一组行中的元素数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆