在数据帧中避免KeyError [英] Avoiding KeyError in dataframe
问题描述
我正在使用以下代码验证我的数据框,
I am validating my dataframe with below code,
df = df[(df[['name', 'issuer_id', 'service_area_id']].notnull().all(axis=1)) &
((df['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4)) &
(df[['network_url', 'formulary_url', 'sbc_download_url', 'treatment_cost_calculator_url']].astype(str).apply(lambda x: (x.str.contains('\A(https?:\/\/)([a-zA-Z0-9\-_])*(\.)*([a-zA-Z0-9\-]+)\.([a-zA-Z\.]{2,5})(\.*.*)?\Z')) | x.isin(['nan'])).all(axis=1)) &
(df[['promotional_label']].astype(str).apply(lambda x: (x.str.len <= 65) | x.isin(['nan'])).all(axis=1)) &
# (df[['sort_rank_override']].astype(str).apply(lambda x: (x.str.isdigit()) | x.isin(['nan'])).all(axis=1)) &
((df['hios_plan_identifier'].notnull()) & (df['hios_plan_identifier'].str.len() >= 10) & (df['hios_plan_identifier'].str.contains('\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\Z'))) &
(df['type'].isin(['MetalPlan', 'MedicarePlan', 'BasicHealthPlan', 'DualPlan', 'MedicaidPlan', 'ChipPlan'])) &
(df['price_period'].isin(['Monthly', 'Yearly'])) &
(df['is_age_29_plan'].astype(str).isin(['True', 'False', 'nan']))]
# (df[['composite_rating']].astype(str).apply(lambda x: (x.str.isin(['True', 'False']) & x.isnotin(['nan'])).all(axis=1)))]
这会丢给我
KeyError:当数据框中不存在该列时, ['name']不在索引中
KeyError: "['name'] not in index"
我需要处理所有列。如何有效地在上面的代码中添加检查,该检查仅在存在该列时检查验证?
when the column is not present in my dataframe. I need to handle for all columns. How can I efficiently add a check to my above code which checks for validation only when the column is present?
推荐答案
您可以使用 交叉点
:
You can use intersection
:
L = ['name', 'issuer_id', 'service_area_id']
cols = df.columns.intersection(L)
(df[cols].notnull().all(axis=1))
编辑:
df = pd.DataFrame({
'name':list('abcdef'),
'plan_year':[2015,2015,2015,5,5,4],
})
print (df)
name plan_year
0 a 2015
1 b 2015
2 c 2015
3 d 5
4 e 5
5 f 4
想法是首先为每个列创建有效值的字典:
Idea is create dictionary of valid values for each colum first:
valid = {'name':'a',
'issuer_id':'a',
'service_area_id':'a',
'plan_year':2015,
...}
然后通过删除列和 分配
到原始 DataFrame
并创建新的DataFrame:
Then filter new dictionary by missing columns and assign
to original DataFrame
and create new DataFrame:
d1 = {k: v for k, v in valid.items() if k in set(valid.keys()) - set(df.columns)}
print (d1)
{'issuer_id': 'a', 'service_area_id': 'a'}
df1 = df.assign(**d1)
print (df1)
name plan_year issuer_id service_area_id
0 a 2015 a a
1 b 2015 a a
2 c 2015 a a
3 d 5 a a
4 e 5 a a
5 f 4 a a
最后一个过滤器:
m1 = (df1[['name', 'issuer_id', 'service_area_id']].notnull().all(axis=1))
m2 = ((df1['plan_year'].notnull()) &
(df1['plan_year'].astype(str).str.isdigit()) &
(df1['plan_year'].astype(str).str.len() == 4))
df1 = df1[m1 & m2]
print (df1)
name plan_year issuer_id service_area_id
0 a 2015 a a
1 b 2015 a a
2 c 2015 a a
最后您可以删除帮助器列:
Last you can remove helper columns:
df1 = df1[m1 & m2].drop(d1.keys(), axis=1)
print (df1)
name plan_year
0 a 2015
1 b 2015
2 c 2015
这篇关于在数据帧中避免KeyError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!