在 pandas 中找出并记录失败的验证条件 [英] Finding out and logging the failed validation condition in pandas

查看:93
本文介绍了在 pandas 中找出并记录失败的验证条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框df,

      plan_year                                    name metal_level_name
0        20118            Gold Heritage Plus 1500 - 02             Gold
1         2018                                     NaN         Platinum
2         2018            Gold Heritage Plus 2000 - 01             Gold

我已经在plan_yearname列上进行了数据验证,如下所示,

I have put a data validation on plan_year and name columns like below,

m4 = ((df['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4))

m1 = (df1[['name']].notnull().all(axis=1))

我在下面得到有效的数据框,

I am getting the valid dataframe with below ,

df1 = df[m1 & m4]

我可以得到df1中不存在的行(无效的行)

I can get the rows which are not present in df1(the rows which are invalid)

merged = df.merge(df1.drop_duplicates(), how='outer', indicator=True)
merged[merged['_merge'] == 'left_only']

我想跟踪由于验证而导致哪一行失败.

I want to keep track as to which row failed due to which validation.

我想获取一个包含所有无效数据数据框的数据框,如下所示-

I want to get a dataframe with all the invalid data dataframe to look something like below-

 plan_year                                    name metal_level_name    Failed message
0        20118            Gold Heritage Plus 1500 - 02             Gold    Failed due to wrong plan_year
1         2018                                     NaN         Platinum     name column cannot be null

有人可以帮我吗?

推荐答案

您可以使用 numpy.select ,其中带~的boolena掩码反转:

You can use numpy.select with inverting boolena masks by ~:

message1 = 'name column cannot be null'
message4 = 'Failed due to wrong plan_year'


df['Failed message'] = np.select([~m1, ~m4], [message1, message4], default='OK')
print (df)
   plan_year                          name metal_level_name  \
0      20118  Gold Heritage Plus 1500 - 02             Gold   
1       2018                           NaN         Platinum   
2       2018  Gold Heritage Plus 2000 - 01             Gold   

                  Failed message  
0  Failed due to wrong plan_year  
1     name column cannot be null  
2                             OK  


df1 = df[df['Failed message'] != 'OK']
print (df1)
   plan_year                          name metal_level_name  \
0      20118  Gold Heritage Plus 1500 - 02             Gold   
1       2018                           NaN         Platinum   

                  Failed message  
0  Failed due to wrong plan_year  
1     name column cannot be null  

对于多个错误消息,请通过DataFrame. > concat ,然后用 dot ,最后通过

For multiple error messages create new DataFrame by concat and then matrix multiple it by columns names with separator by dot and last remove separator from rigth side by rstrip:

print (df)
   plan_year                          name metal_level_name
0      20118  Gold Heritage Plus 1500 - 02             Gold
1       2018                           NaN         Platinum
2       2018  Gold Heritage Plus 2000 - 01             Gold
1      20148                           NaN         Platinum

message1 = 'name column cannot be null'
message4 = 'Failed due to wrong plan_year'

df1 = pd.concat([~m1, ~m4], axis=1, keys=[message1, message4])
print (df1)
   name column cannot be null  Failed due to wrong plan_year
0                       False                           True
1                        True                          False
2                       False                          False
1                        True                           True


df['Failed message'] = df1.dot(df1.columns + ', ').str.rstrip(', ')
print (df)

   plan_year                          name metal_level_name  \
0      20118  Gold Heritage Plus 1500 - 02             Gold   
1       2018                           NaN         Platinum   
2       2018  Gold Heritage Plus 2000 - 01             Gold   
1      20148                           NaN         Platinum   

                                      Failed message  
0                      Failed due to wrong plan_year  
1                         name column cannot be null  
2                                                     
1  name column cannot be null, Failed due to wron...  


df1 = df[df['Failed message'] != '']
print (df1)
   plan_year                          name metal_level_name  \
0      20118  Gold Heritage Plus 1500 - 02             Gold   
1       2018                           NaN         Platinum   
1      20148                           NaN         Platinum   

                                      Failed message  
0                      Failed due to wrong plan_year  
1                         name column cannot be null  
1  name column cannot be null, Failed due to wron...  

这篇关于在 pandas 中找出并记录失败的验证条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆