通过字典有效地替换 pandas 系列中的值 [英] Replace values in a pandas series via dictionary efficiently

查看:59
本文介绍了通过字典有效地替换 pandas 系列中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何通过字典d替换Pandas系列s中的值已被多次询问和重新询问.

How to replace values in a Pandas series s via a dictionary d has been asked and re-asked many times.

推荐方法(1234) 要么使用 s.replace(d) ,要么偶尔使用 s.map(d) 如果你所有的系列值在字典键中找到.

The recommended method (1, 2, 3, 4) is to either use s.replace(d) or, occasionally, use s.map(d) if all your series values are found in the dictionary keys.

然而,使用 s.replace 的性能通常慢得不合理,通常比简单的列表理解慢 5-10 倍.

However, performance using s.replace is often unreasonably slow, often 5-10x slower than a simple list comprehension.

替代方案 s.map(d) 具有良好的性能,但仅在字典中找到所有键时才推荐使用.

The alternative, s.map(d) has good performance, but is only recommended when all keys are found in the dictionary.

为什么 s.replace 这么慢,如何提高性能?

Why is s.replace so slow and how can performance be improved?

import pandas as pd, numpy as np

df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)})
lst = df['A'].values.tolist()

##### TEST 1 #####

d = {i: i+1 for i in range(1000)}

%timeit df['A'].replace(d)                          # 1.98s
%timeit [d[i] for i in lst]                         # 134ms

##### TEST 2 #####

d = {i: i+1 for i in range(10)}

%timeit df['A'].replace(d)                          # 20.1ms
%timeit [d.get(i, i) for i in lst]                  # 243ms

注意:这个问题没有被标记为重复,因为它正在寻找关于何时使用不同方法给不同数据集的具体建议.这在答案中很明确,并且是其他问题中通常不涉及的方面.

Note: This question is not marked as a duplicate because it is looking for specific advice on when to use different methods given different datasets. This is explicit in the answer and is an aspect not usually addressed in other questions.

推荐答案

一个简单的解决方案是根据字典键覆盖值的完整程度来选择一种方法.

One trivial solution is to choose a method dependent on an estimate of how completely values are covered by dictionary keys.

一般情况

  • 如果所有值都被映射,则使用 df['A'].map(d);或
  • 如果映射了 >5% 的值,则使用 df['A'].map(d).fillna(df['A']).astype(int).

很少,例如<5%,d 中的值

  • 使用df['A'].replace(d)

~5% 的交叉点"特定于下面的基准测试.

The "crossover point" of ~5% is specific to Benchmarking below.

有趣的是,在任何一种情况下,简单的列表理解通常都比 map 表现不佳.

Interestingly, a simple list comprehension generally underperforms map in either scenario.

基准测试

import pandas as pd, numpy as np

df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)})
lst = df['A'].values.tolist()

##### TEST 1 - Full Map #####

d = {i: i+1 for i in range(1000)}

%timeit df['A'].replace(d)                          # 1.98s
%timeit df['A'].map(d)                              # 84.3ms
%timeit [d[i] for i in lst]                         # 134ms

##### TEST 2 - Partial Map #####

d = {i: i+1 for i in range(10)}

%timeit df['A'].replace(d)                          # 20.1ms
%timeit df['A'].map(d).fillna(df['A']).astype(int)  # 111ms
%timeit [d.get(i, i) for i in lst]                  # 243ms

说明

s.replace 如此缓慢的原因在于它所做的不仅仅是映射字典.它处理一些边缘情况和可以说是罕见的情况,在任何情况下通常都值得更多关注.

The reason why s.replace is so slow is that it does much more than simply map a dictionary. It deals with some edge cases and arguably rare situations, which typically merit more care in any case.

这是摘自 pandasgeneric.py.

This is an excerpt from replace() in pandasgeneric.py.

items = list(compat.iteritems(to_replace))
keys, values = zip(*items)
are_mappings = [is_dict_like(v) for v in values]

if any(are_mappings):
    # handling of nested dictionaries
else:
    to_replace, value = keys, values

return self.replace(to_replace, value, inplace=inplace,
                    limit=limit, regex=regex)

似乎涉及许多步骤:

  • 将字典转换为列表.
  • 遍历列表并检查嵌套字典.
  • 将键和值的迭代器提供给替换函数.

这可以与 pandasseries.py:

This can be compared to much leaner code from map() in pandasseries.py:

if isinstance(arg, (dict, Series)):
    if isinstance(arg, dict):
        arg = self._constructor(arg, index=arg.keys())

    indexer = arg.index.get_indexer(values)
    new_values = algos.take_1d(arg._values, indexer)

这篇关于通过字典有效地替换 pandas 系列中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆