在 pandas 数据框中添加错误日志消息行 [英] Adding a error log message row in pandas dataframe

查看:58
本文介绍了在 pandas 数据框中添加错误日志消息行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基于此答案,
避免在数据帧中出现KeyError ,我可以进行验证。但是我需要跟踪由于哪种验证条件而导致哪一行失败。

Based on this answer, Avoiding KeyError in dataframe, I am able to do my validations. But I need to keep a track as to which row is failing due to which validation condition.

有没有办法添加新列并提供失败消息?

Is there a way where I can add a new column and provide a fail message?

我的代码-

valid_dict = {'name': 'WI 80 INDEMNITY 18 OPTION 1 SILVER RX $10/45/90/25%',
                          'issuer_id': 484,
                          'service_area_id': 1,
                          'plan_year': 2018,
                          'network_url': np.nan,
                          'formulary_url': np.nan,
                          'sbc_download_url': np.nan,
                          'treatment_cost_calculator_url': np.nan,
                          'promotional_label': np.nan,
                          'hios_plan_identifier': '99806CAAUSJ-TMP',
                          'type': 'MetalPlan',
                          'price_period': 'Monthly',
                          'is_age_29_plan': False,
                          'sort_rank_override': np.nan,
                          'composite_rating': False,
                          }

            data_obj = DataService()
            hios_issuer_identifer_list = data_obj.get_hios_issuer_identifer(df)

            d1 = {k: v for k, v in valid_dict.items() if k in set(valid_dict.keys()) - set(df.columns)}
            df1 = df.assign(**d1)
            cols_url = df.columns.intersection(['network_url', 'formulary_url', 'sbc_download_url', 'treatment_cost_calculator_url'])
            m1 = (df1[['name', 'issuer_id', 'service_area_id']].notnull().all(axis=1))
            m2 = (df1[['promotional_label']].astype(str).apply(lambda x: (x.str.len <= 65) | x.isin(['nan'])).all(axis=1))
            m3 = (df1[cols_url].astype(str).apply(lambda x: (x.str.contains('\A(https?:\/\/)([a-zA-Z0-9\-_])*(\.)*([a-zA-Z0-9\-]+)\.([a-zA-Z\.]{2,5})(\.*.*)?\Z')) | x.isin(['nan'])).all(axis=1))
            m4 = ((df1['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4))
            m5 = ((df1['hios_plan_identifier'].notnull()) & (df['hios_plan_identifier'].str.len() >= 10) & (df['hios_plan_identifier'].str.contains('\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\Z')))
            m6 = (df1['type'].isin(['MetalPlan', 'MedicarePlan', 'BasicHealthPlan', 'DualPlan', 'MedicaidPlan', 'ChipPlan']))
            m7 = (df1['price_period'].isin(['Monthly', 'Yearly']))
            m8 = (df1['is_age_29_plan'].astype(str).isin(['True', 'False', 'nan']))
            m9 = (df1[['sort_rank_override']].astype(str).apply(lambda x: (x.str.isdigit()) | x.isin(['nan'])).all(axis=1))
            m10 = (df1['composite_rating'].astype(str).isin(['True', 'False']))
            m11 = (df1['hios_plan_identifier'].astype(str).str[:5].isin(hios_issuer_identifer_list))

            df1 = df1[m1 & m2 & m3 & m4 & m5 & m6 & m7 & m8 & m9 & m10 & m11].drop(d1.keys(), axis=1)

            merged =  df.merge(df1.drop_duplicates(), how='outer', indicator=True)
            merged[merged['_merge'] == 'left_only'].to_csv('logs/invalid_plan_data.csv')

            return df1

类似下面的内容-

 wellthie_issuer_identifier  issuer_name    ...     service_area_id     _error
0                   UHC99806  Fake Humana    ...                   1  failed on plan_year


推荐答案

使用 df1 = df1 [m1&平方米和立方米平方米和m5和m6& m7& m8& m9& 10和m11] .drop(d1.keys(),axis = 1)您正在选择没有任何条件失败的行。显然,您这里不会有想要的东西,这没关系,因为这是经过验证的部分,应该没有错误。

With df1 = df1[m1 & m2 & m3 & m4 & m5 & m6 & m7 & m8 & m9 & m10 & m11].drop(d1.keys(), axis=1) you are selecting the rows where none of your conditions failed. So clearly you will not have what you would like here, and that is ok, as this is the validated part which should not have errors.

您可以通过以下方式获取错误在删除失败的行之前进行另一次选择:

You can get the errors by doing another selection before dropping the failed lines:

df_error = df1.copy()
df_error['error_message'] = ~m1
...

如果该列有错误,则可以定义一些将在表中显示的错误文本:

If the column had an error, you could define some error text to be displayed in the table:

df_error['failed_on_name'] = pd.where(m1, your_message_here)

如果要在日志中显示错误,可以遍历错误表并输出消息(考虑列中具有布尔值的第一个版本):

If you want to display the error to a log you can loop over your error table and output your message (considering the first version with boolean values in the columns):

for _, row in df_error.iterrows():
    print (error_message(dict(row)))

因此,您可以使用像这样的功能:

So you will be able to process the rows with a function like this:

def error_message(row):
    row_desc = []
    error_msg = []
    for k, v in row.items():
        if isinstance(v, bool):
            if v:
                error_msg.append(k)
        else:
            row_desc.append(v)
    return 'Row ' + ' '.join(row_desc) + ' failed with errors: ' + ' '.join(error_msg)

这篇关于在 pandas 数据框中添加错误日志消息行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆