组 pandas 数据框和有条件验证 [英] Group Pandas Dataframe & validate with condition

查看:64
本文介绍了组 pandas 数据框和有条件验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

数据框:

id   Base   field1    field2    field3
1     Y      AA         BB        CC
1     N      AA         BB        CC
1     N      AA         BB        CC     
2     Y      DD         EE        FF
2     N      OO         EE        WT
2     N      DD         JQ        FF
3     Y      MM         NN        TT
3     Y      MM         NN        TT 
3     N      MM         NN        TT

预期结果是根据ID列对该数据帧进行分组,应进行2次验证.

The expected result is to group this dataframe based on the ID column, 2 validations should be performed.

  1. 首先检查每个组中是否只有一个基本值"Y".如果仅是真的,则应将该行作为验证步骤2的参考,否则将错误写为为ID找到多个基数Y",并继续执行步骤1获取下一个ID

  1. first check if there is only one Base value 'Y' in each group. If its only true, then this row should be taken as a reference to validate step 2, else write the error as "More than one base Y found for ID" and proceed with step 1 for next ID

验证所有其他具有"Base:N"的列上的数据是否与Base为"Y"的列上的数据匹配,并在error列中写入不匹配的字段名称.产品栏是唯一字段,可以忽略以进行数据比较.

Validate if data on all the other columns that have "Base:N" match with the data on the columns where Base is 'Y', and write the names of fields that are not matching in the error column. product column is a unique field and it can be ignored for comparison of data.

针对数据帧中的所有ID重复此操作.

Repeat this for all the ID int the dataframe.

预期结果是

id  product Base  field1  field2  field3   Error
1   A        Y     AA       BB      CC     Reference value
1   B        N     AA       BB      CC     Pass
1   C        N     AA       BB      CC     Pass
2   D        Y     DD       EE      FF     Reference value
2   E        N     OO       EE      WT     field1, field3 mismatch    
2   F        N     DE       JQ      FF     field1, field2 mismatch 
3   G        Y     MM       NN      TT     more than 1 Y found for id:
3   H        Y     MM       NN      TT     more than 1 Y found for id:
3   I        N     MM       NN      TT     more than 1 Y found for id:

对此有任何帮助吗?

推荐答案

使用自定义功能:

def f(x):
    #boolena mask for compare Y
    mask = x['Base'] == 'Y'
    #check multiple Y by sum of Trues
    if mask.sum() > 1:
        x['Error'] = 'more than 1 base Y found for id:{}'.format(x.name)
    else:
        #remove columns for not comparing with not equal
        cols = x.columns.difference(['Base','product'])
        mask1 = x[cols].ne(x.loc[mask, cols])
        #if difference get columns names by dot
        if mask1.values.any():
            vals = mask1.dot(mask1.columns + ', ').str.rstrip(', ') + ' mismatch with base' 
            x['Error'] = np.where(mask, 'Base: Y', vals)    
        else:
            x['Error'] = np.where(mask, 'Base: Y', 'Pass')    

    return x

df = df.groupby(level=0).apply(f)
print (df)
   product Base field1 field2 field3                              Error
id                                                                     
1        A    Y     AA     BB     CC                            Base: Y
1        B    N     AA     BB     CC                               Pass
1        C    N     AA     BB     CC                               Pass
2        D    Y     DD     EE     FF                            Base: Y
2        E    N     OO     EE     WT  field1, field3 mismatch with base
2        F    N     DD     JQ     FF          field2 mismatch with base
3        G    Y     MM     NN     TT  more than 1 base Y found for id:3
3        H    Y     MM     NN     TT  more than 1 base Y found for id:3
3        I    N     MM     NN     TT  more than 1 base Y found for id:3

示例数据框:

df = pd.DataFrame({'id': [1, 1, 1, 2, 2, 2, 3, 3, 3], 
                   'product': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'], 
                   'Base': ['Y', 'N', 'N', 'Y', 'N', 'N', 'Y', 'Y', 'N'], 
                   'field1': ['AA', 'AA', 'AA', 'DD', 'OO', 'DD', 'MM', 'MM', 'MM'], 
                   'field2': ['BB', 'BB', 'BB', 'EE', 'EE', 'JQ', 'NN', 'NN', 'NN'], 
                   'field3': ['CC', 'CC', 'CC', 'FF', 'WT', 'FF', 'TT', 'TT', 'TT']})
df = df.set_index('id')
print (df)
   product Base field1 field2 field3
id                                  
1        A    Y     AA     BB     CC
1        B    N     AA     BB     CC
1        C    N     AA     BB     CC
2        D    Y     DD     EE     FF
2        E    N     OO     EE     WT
2        F    N     DD     JQ     FF
3        G    Y     MM     NN     TT
3        H    Y     MM     NN     TT
3        I    N     MM     NN     TT

这篇关于组 pandas 数据框和有条件验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆