没有 KeyError 的 Pandas .loc [英] Pandas .loc without KeyError

查看:65
本文介绍了没有 KeyError 的 Pandas .loc的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

<预><代码>>>>pd.DataFrame([1], index=['1']).loc['2'] # KeyError>>>pd.DataFrame([1], index=['1']).loc[['2']] # KeyError>>>pd.DataFrame([1], index=['1']).loc[['1','2']] # 成功,如下面的答案.

我想要任何一个都不会失败的东西

<预><代码>>>>pd.DataFrame([1], index=['1']).loc['2'] # KeyError>>>pd.DataFrame([1], index=['1']).loc[['2']] # KeyError

有没有像 loc 这样的函数可以优雅地处理这个问题,或者有其他表达这个查询的方式?

解决方案

更新@AlexLenail 评论
公平地说,这对于大型列表来说会很慢.我做了更多的挖掘和 发现 intersection 方法可用于 Indexes 和列.我不确定算法的复杂性,但根据经验,它要快得多.

你可以做这样的事情.

good_keys = df.index.intersection(all_keys)df.loc[good_keys]

或者喜欢你的例子

df = pd.DataFrame([1], index=['1'])df.loc[df.index.intersection(['2'])]

下面是一个小实验

n = 100000# 创建随机值和随机字符串索引# 让坏索引包含不在 DataFrame 索引中的额外值rand_val = np.random.rand(n)rand_idx = []对于范围内的 x(n):rand_idx.append(str(x))bad_idx = []对于范围内的 x(n*2):bad_idx.append(str(x))df = pd.DataFrame(rand_val, index=rand_idx)df.head()def get_valid_keys_list_comp():# 使用列表理解来过滤键返回过滤后的 DataFramevkeys = [如果键在 df.index.values 中,则键为 bad_idx 中的键]返回 df.loc[vkeys]def get_valid_keys_intersection():# 使用列表intersection() 返回过滤后的DataFrame 来过滤键vkeys = df.index.intersection(bad_idx)返回 df.loc[vkeys]%%时间get_valid_keys_intersection()# 每个循环 64.5 ms ± 4.53 ms(平均值 ± 标准偏差,7 次运行,每次 10 次循环)%%时间get_valid_keys_list_comp()# 每个循环 6.14 s ± 457 ms(平均值 ± 标准偏差,7 次运行,每个循环 1 次)

原答案

我不确定 Pandas 是否有一个内置函数来处理这个问题,但你可以使用 Python 列表理解来过滤到有效的索引.

给定一个数据帧 df2

 A B C D F测试 1.0 2013-01-02 1.0 3 foo火车 1.0 2013-01-02 1.0 3 foo测试 1.0 2013-01-02 1.0 3 foo火车 1.0 2013-01-02 1.0 3 foo

您可以使用此过滤索引查询

keys = ['test', 'train', 'try', 'fake', 'broken']valid_keys = [key for key in keys if key in df2.index.values]df2.loc[valid_keys]

如果您使用 df2.columns 而不是 df2.index.values

这也适用于列

>>> pd.DataFrame([1], index=['1']).loc['2']  # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['2']]  # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['1','2']]  # Succeeds, as in the answer below. 

I'd like something that doesn't fail in either of

>>> pd.DataFrame([1], index=['1']).loc['2']  # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['2']]  # KeyError

Is there a function like loc which gracefully handles this, or some other way of expressing this query?

解决方案

Update for @AlexLenail comment
It's a fair point that this will be slow for large lists. I did a little bit of more digging and found that the intersection method is available for Indexes and columns. I'm not sure about the algorithmic complexity but it's much faster empirically.

You can do something like this.

good_keys = df.index.intersection(all_keys)
df.loc[good_keys]

Or like your example

df = pd.DataFrame([1], index=['1'])
df.loc[df.index.intersection(['2'])]

Here is a little experiment below

n = 100000

# Create random values and random string indexes
# have the bad indexes contain extra values not in DataFrame Index
rand_val = np.random.rand(n)
rand_idx = []
for x in range(n):
    rand_idx.append(str(x))

bad_idx = []
for x in range(n*2):
    bad_idx.append(str(x))

df = pd.DataFrame(rand_val, index=rand_idx)
df.head()

def get_valid_keys_list_comp():
    # Return filtered DataFrame using list comprehension to filter keys
    vkeys = [key for key in bad_idx if key in df.index.values]
    return df.loc[vkeys]

def get_valid_keys_intersection():
    # Return filtered DataFrame using list intersection() to filter keys
    vkeys = df.index.intersection(bad_idx)
    return df.loc[vkeys]

%%timeit 
get_valid_keys_intersection()
# 64.5 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit 
get_valid_keys_list_comp()
# 6.14 s ± 457 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Original answer

I'm not sure if pandas has a built-in function to handle this but you can use Python list comprehension to filter to valid indexes with something like this.

Given a DataFrame df2

           A    B       C   D    F
test    1.0 2013-01-02  1.0 3   foo
train   1.0 2013-01-02  1.0 3   foo
test    1.0 2013-01-02  1.0 3   foo
train   1.0 2013-01-02  1.0 3   foo

You can filter your index query with this

keys = ['test', 'train', 'try', 'fake', 'broken']
valid_keys = [key for key in keys if key in df2.index.values]
df2.loc[valid_keys]

This will also work for columns if you use df2.columns instead of df2.index.values

这篇关于没有 KeyError 的 Pandas .loc的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆