通过字典有效替换 pandas 系列中的值 [英] Replace values in a pandas series via dictionary efficiently

查看:73
本文介绍了通过字典有效替换 pandas 系列中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于如何通过字典d替换熊猫系列s中的值的问题已被询问并多次提出.

How to replace values in a Pandas series s via a dictionary d has been asked and re-asked many times.

推荐的方法( 1 3

The recommended method (1, 2, 3, 4) is to either use s.replace(d) or, occasionally, use s.map(d) if all your series values are found in the dictionary keys.

但是,使用s.replace的性能通常会不合理地降低,通常比简单的列表理解要慢5-10倍.

However, performance using s.replace is often unreasonably slow, often 5-10x slower than a simple list comprehension.

替代项s.map(d)具有良好的性能,但仅当在字典中找到所有键时才建议使用.

The alternative, s.map(d) has good performance, but is only recommended when all keys are found in the dictionary.

为什么s.replace这么慢,如何提高性能?

Why is s.replace so slow and how can performance be improved?

import pandas as pd, numpy as np

df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)})
lst = df['A'].values.tolist()

##### TEST 1 #####

d = {i: i+1 for i in range(1000)}

%timeit df['A'].replace(d)                          # 1.98s
%timeit [d[i] for i in lst]                         # 134ms

##### TEST 2 #####

d = {i: i+1 for i in range(10)}

%timeit df['A'].replace(d)                          # 20.1ms
%timeit [d.get(i, i) for i in lst]                  # 243ms

注意:该问题未标记为重复问题,因为它正在针对何时使用(针对给定不同数据集的不同方法)寻求具体建议.这在答案中是明确的,并且是其他问题通常未解决的一个方面.

Note: This question is not marked as a duplicate because it is looking for specific advice on when to use different methods given different datasets. This is explicit in the answer and is an aspect not usually addressed in other questions.

推荐答案

一个简单的解决方案是选择一种方法,该方法取决于对字典键完全覆盖值的估计.

One trivial solution is to choose a method dependent on an estimate of how completely values are covered by dictionary keys.

一般情况

  • 如果所有值都已映射,请使用df['A'].map(d);否则,请使用df['A'].map(d).或
  • 如果已映射> 5%的值,请使用df['A'].map(d).fillna(df['A']).astype(int).
  • Use df['A'].map(d) if all values mapped; or
  • Use df['A'].map(d).fillna(df['A']).astype(int) if >5% values mapped.

很少,例如< 5%,d中的值

  • 使用df['A'].replace(d)

〜5%的交叉点"特定于下面的基准测试.

The "crossover point" of ~5% is specific to Benchmarking below.

有趣的是,在两种情况下,简单的列表理解通常都不如map.

Interestingly, a simple list comprehension generally underperforms map in either scenario.

基准化

import pandas as pd, numpy as np

df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)})
lst = df['A'].values.tolist()

##### TEST 1 - Full Map #####

d = {i: i+1 for i in range(1000)}

%timeit df['A'].replace(d)                          # 1.98s
%timeit df['A'].map(d)                              # 84.3ms
%timeit [d[i] for i in lst]                         # 134ms

##### TEST 2 - Partial Map #####

d = {i: i+1 for i in range(10)}

%timeit df['A'].replace(d)                          # 20.1ms
%timeit df['A'].map(d).fillna(df['A']).astype(int)  # 111ms
%timeit [d.get(i, i) for i in lst]                  # 243ms

说明

s.replace之所以这么慢的原因是它的作用远不只是映射字典.它处理一些极端情况和可能很少见的情况,这些情况通常在任何情况下都应格外小心.

The reason why s.replace is so slow is that it does much more than simply map a dictionary. It deals with some edge cases and arguably rare situations, which typically merit more care in any case.

这是replace()的摘录="noreferrer"> pandas\generic.py .

This is an excerpt from replace() in pandas\generic.py.

items = list(compat.iteritems(to_replace))
keys, values = zip(*items)
are_mappings = [is_dict_like(v) for v in values]

if any(are_mappings):
    # handling of nested dictionaries
else:
    to_replace, value = keys, values

return self.replace(to_replace, value, inplace=inplace,
                    limit=limit, regex=regex)

似乎涉及很多步骤:

  • 将字典转换为列表.
  • 遍历列表并检查嵌套字典.
  • 将键和值的迭代器输入到替换函数中.

这可以与map()中更精简的代码进行比较-L2357"rel =" noreferrer> pandas\series.py :

This can be compared to much leaner code from map() in pandas\series.py:

if isinstance(arg, (dict, Series)):
    if isinstance(arg, dict):
        arg = self._constructor(arg, index=arg.keys())

    indexer = arg.index.get_indexer(values)
    new_values = algos.take_1d(arg._values, indexer)

这篇关于通过字典有效替换 pandas 系列中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆