在 pandas 中找出并记录失败的验证条件 [英] Finding out and logging the failed validation condition in pandas
问题描述
我有一个数据框df,
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
我已经在plan_year
和name
列上进行了数据验证,如下所示,
I have put a data validation on plan_year
and name
columns like below,
m4 = ((df['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4))
m1 = (df1[['name']].notnull().all(axis=1))
我在下面得到有效的数据框,
I am getting the valid dataframe with below ,
df1 = df[m1 & m4]
我可以得到df1中不存在的行(无效的行)
I can get the rows which are not present in df1(the rows which are invalid)
merged = df.merge(df1.drop_duplicates(), how='outer', indicator=True)
merged[merged['_merge'] == 'left_only']
我想跟踪由于验证而导致哪一行失败.
I want to keep track as to which row failed due to which validation.
我想获取一个包含所有无效数据数据框的数据框,如下所示-
I want to get a dataframe with all the invalid data dataframe to look something like below-
plan_year name metal_level_name Failed message
0 20118 Gold Heritage Plus 1500 - 02 Gold Failed due to wrong plan_year
1 2018 NaN Platinum name column cannot be null
有人可以帮我吗?
推荐答案
您可以使用 numpy.select
,其中带~
的boolena掩码反转:
You can use numpy.select
with inverting boolena masks by ~
:
message1 = 'name column cannot be null'
message4 = 'Failed due to wrong plan_year'
df['Failed message'] = np.select([~m1, ~m4], [message1, message4], default='OK')
print (df)
plan_year name metal_level_name \
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
2 OK
df1 = df[df['Failed message'] != 'OK']
print (df1)
plan_year name metal_level_name \
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
对于多个错误消息,请通过DataFrame. > concat
,然后用 dot
,最后通过
For multiple error messages create new DataFrame
by concat
and then matrix multiple it by columns names with separator by dot
and last remove separator from rigth side by rstrip
:
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
1 20148 NaN Platinum
message1 = 'name column cannot be null'
message4 = 'Failed due to wrong plan_year'
df1 = pd.concat([~m1, ~m4], axis=1, keys=[message1, message4])
print (df1)
name column cannot be null Failed due to wrong plan_year
0 False True
1 True False
2 False False
1 True True
df['Failed message'] = df1.dot(df1.columns + ', ').str.rstrip(', ')
print (df)
plan_year name metal_level_name \
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
1 20148 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
2
1 name column cannot be null, Failed due to wron...
df1 = df[df['Failed message'] != '']
print (df1)
plan_year name metal_level_name \
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
1 20148 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
1 name column cannot be null, Failed due to wron...
这篇关于在 pandas 中找出并记录失败的验证条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!