在数据帧中避免KeyError [英] Avoiding KeyError in dataframe

查看:95
本文介绍了在数据帧中避免KeyError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下代码验证我的数据框,

I am validating my dataframe with below code,

df = df[(df[['name', 'issuer_id', 'service_area_id']].notnull().all(axis=1)) &
                ((df['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4)) &
                (df[['network_url', 'formulary_url', 'sbc_download_url', 'treatment_cost_calculator_url']].astype(str).apply(lambda x: (x.str.contains('\A(https?:\/\/)([a-zA-Z0-9\-_])*(\.)*([a-zA-Z0-9\-]+)\.([a-zA-Z\.]{2,5})(\.*.*)?\Z')) | x.isin(['nan'])).all(axis=1)) &
                (df[['promotional_label']].astype(str).apply(lambda x: (x.str.len <= 65) | x.isin(['nan'])).all(axis=1)) &
                # (df[['sort_rank_override']].astype(str).apply(lambda x: (x.str.isdigit()) | x.isin(['nan'])).all(axis=1)) &
                ((df['hios_plan_identifier'].notnull()) & (df['hios_plan_identifier'].str.len() >= 10) & (df['hios_plan_identifier'].str.contains('\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\Z'))) &
                (df['type'].isin(['MetalPlan', 'MedicarePlan', 'BasicHealthPlan', 'DualPlan', 'MedicaidPlan', 'ChipPlan'])) &
                (df['price_period'].isin(['Monthly', 'Yearly'])) &
                (df['is_age_29_plan'].astype(str).isin(['True', 'False', 'nan']))]
                # (df[['composite_rating']].astype(str).apply(lambda x: (x.str.isin(['True', 'False']) & x.isnotin(['nan'])).all(axis=1)))]

这会丢给我


KeyError:当数据框中不存在该列时, ['name']不在索引中

KeyError: "['name'] not in index"

我需要处理所有列。如何有效地在上面的代码中添加检查,该检查仅在存在该列时检查验证?

when the column is not present in my dataframe. I need to handle for all columns. How can I efficiently add a check to my above code which checks for validation only when the column is present?

推荐答案

您可以使用 交叉点

You can use intersection:

L = ['name', 'issuer_id', 'service_area_id']
cols = df.columns.intersection(L)

(df[cols].notnull().all(axis=1))

编辑:

df = pd.DataFrame({
        'name':list('abcdef'),
         'plan_year':[2015,2015,2015,5,5,4],
})
print (df)
  name  plan_year
0    a       2015
1    b       2015
2    c       2015
3    d          5
4    e          5
5    f          4

想法是首先为每个列创建有效值的字典:

Idea is create dictionary of valid values for each colum first:

valid = {'name':'a', 
        'issuer_id':'a',
        'service_area_id':'a',
        'plan_year':2015,
         ...}

然后通过删除列和 分配 到原始 DataFrame 并创建新的DataFrame:

Then filter new dictionary by missing columns and assign to original DataFrame and create new DataFrame:

d1 = {k: v for k, v in valid.items() if k in set(valid.keys()) - set(df.columns)}
print (d1)
{'issuer_id': 'a', 'service_area_id': 'a'}


df1 = df.assign(**d1)
print (df1)
  name  plan_year issuer_id service_area_id
0    a       2015         a               a
1    b       2015         a               a
2    c       2015         a               a
3    d          5         a               a
4    e          5         a               a
5    f          4         a               a

最后一个过滤器:

m1 = (df1[['name', 'issuer_id', 'service_area_id']].notnull().all(axis=1)) 
m2 = ((df1['plan_year'].notnull()) & 
      (df1['plan_year'].astype(str).str.isdigit()) & 
      (df1['plan_year'].astype(str).str.len() == 4))

df1 = df1[m1 & m2]
print (df1)
  name  plan_year issuer_id service_area_id
0    a       2015         a               a
1    b       2015         a               a
2    c       2015         a               a

最后您可以删除帮助器列:

Last you can remove helper columns:

df1 = df1[m1 & m2].drop(d1.keys(), axis=1)
print (df1)
  name  plan_year
0    a       2015
1    b       2015
2    c       2015

这篇关于在数据帧中避免KeyError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆