有效地在字符串中进行多次多次替换 [英] Efficiently make many multiple substitutions in a string

查看:39
本文介绍了有效地在字符串中进行多次多次替换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

人们之前已经解决了如何在基于字典的字符串中进行多次替换(参见,例如).似乎有一组基于 string.replace 的选项和一组基于正则表达式的选项,还有几个选项.

People have addressed before how to make multiple substitutions in a string based on a dictionary (see, for example). There seems to be a group of options based on string.replace and a group of options based on regular expressions, with a couple more options.

但我对取决于字典大小的不同方法的效率感兴趣,我发现这具有非常重要的影响.

But I am interested in the efficiency of the different methods depending on the size of the dictionary, which I found to have a very important impact.

my_subs = {'Hello world': 'apple', 'is': 'ship'}
string = 'Hello world! This is nice.'
new_string = my_efficient_method(string, my_subs)

需要明确的是,这个问题不是关于如何进行这些替换,而是关于哪种方法在哪些条件下更有效,以及哪些注意事项适用.特别是,当有很多(>100k)个长字符串(10-20k 个字符)并且字典很大(>80k 对)时,我正在寻找最实用的方法.在这些情况下,基于正则表达式的首选方法性能非常差.

To be clear, this question is not about how to make these substitutions, but about which method is more efficient in which conditions, and which caveats apply. In particular, I am looking for the most practical approach when there are many (>100k) strings that are long (10-20k characters) and the dictionary is huge (>80k pairs). Under these circumstances the preferred methods based on regular expressions perform very poorly.

推荐答案

如前所述,有不同的方法,每种方法都有不同的优势.我使用三种不同的情况进行比较.

As stated before, there are different approaches, each with different advantages. I am using three different situations for comparison.

  1. 简短字典(847 个替换对)
  2. 中型词典(2528 对)
  3. 长字典(80430 对)

对于字典 1 和 2(较短的),我将每种方法重复 50 次,以获得更一致的时间.对于较长的一个文档,单次通过需要足够长的时间(可悲的是).我使用在线 service tio 和 Python 3.8 测试了 1 和 2.长的在我的笔记本电脑上用 Python 3.6 测试过.只有方法之间的相对性能是相关的,所以次要的细节并不重要.

For dictionaries 1 and 2 (shorter ones) I repeat each method 50 times in a loop, to get a more consistent timing. With the longer one a single pass for one document takes long enough (sadly). I tested 1 and 2 using the online service tio with Python 3.8. The long one was tested in my laptop with Python 3.6. Only relative performance between methods is relevant, so the minor specifics are not important.

我的字符串在 28k 到 29k 个字符之间.

My string is between 28k and 29k characters.

所有时间以秒为单位.


一位同事发现了 Flashtext,一个 Python 库,专门用于此.它允许按查询搜索并应用替换.它比其他替代方案快大约两个数量级.在实验 3 中,我目前的最佳时间是 1.8 秒.Flashtext 需要 0.015 秒.

A colleague found Flashtext, a Python library that specializes precisely in this. It allows searching by query and also applying substitutions. It is about two orders of magnitude faster than other alternatives. In the experiment 3 my current best time was 1.8 seconds. Flashtext takes 0.015 seconds.


有很多变化,但最好的往往与此非常相似:

There are many variations, but the best tend to be very similar to this:

import re
rep = dict((re.escape(k), v) for k, v in my_dict.items())
pattern = re.compile("|".join(rep.keys()))
new_string = pattern.sub(lambda m: rep[re.escape(m.group(0))], string)

执行时间为:

  1. 1.63
  2. 5.03
  3. 7.7


此方法只是在循环中应用 string.replace.(稍后我会谈到这方面的问题.)

This method simply applies string.replace in a loop. (Later I talk about problems with this.)

for original, replacement in self.my_dict.items():
    string = string.replace(original, replacement)

此解决方案提出了一种使用 reduce 的变体,它以迭代方式应用 Lambda 表达式.最好通过官方文档中的示例来理解这一点.表达式

This solution proposes a variation using reduce, that applies a Lambda expression iteratively. This is best understood with an example from the official documentation. The expression

reduce(lambda x, y: x+y, [1, 2, 3, 4, 5])

等于 ((((1+2)+3)+4)+5)

equals ((((1+2)+3)+4)+5)

import functools
new_string = functools.reduce(lambda a, k: a.replace(*k), 
                              my_dict.items(), string)

Python 3.8 允许赋值表达式,如这个方法.在其核心,这也依赖于 string.replace.

Python 3.8 allows assignment expressions, as in this method. In its core this also relies on string.replace.

[string := string.replace(f' {a} ', f' {b} ') for a, b in my_dict.items()]

执行时间是(括号内为reduce和赋值表达式变体的结果):

Execution times were (in parenthesis results for reduce and assignment expressions variants):

  1. 1.37 (1.39) (1.50)
  2. 4.10 (4.12) (4.07)
  3. 1.9 (1.8)(机器中没有 Python 3.8)


该提议涉及使用递归 Lambda.

This proposal involves using a recursive Lambda.

mrep = lambda s, d: s if not d else mrep(s.replace(*d.popitem()), d)
new_string = mrep(string, my_dict)

执行时间为:

  1. 0.07
  2. 递归错误
  3. 递归错误


请参阅上面的更新:Flashtext 比其他替代方案快得多.

See the update above: Flashtext is much faster than the other alternatives.

从执行时间可以看出,递归方法显然是最快的,但它只适用于小字典.不建议在 Python 中过多地增加递归深度,因此对于较长的字典,这种方法完全被丢弃.

You can see from the execution times that the recursive approach is clearly the fastest, but it only works with small dictionaries. It is not recommended to increase the recursion depth much in Python, so this approach is entirely discarded for longer dictionaries.

正则表达式提供了对替换的更多控制.例如,您可以在元素之前或之后使用 \b 以确保目标子字符串的该侧没有单词字符(以防止应用 {'a': '1'}'苹果').代价是较长字典的性能急剧下降,几乎是其他选项的四倍.

Regular expressions offer more control over your substitutions. For example, you may use \b before or after an element to ensure that there are no word characters at that side of the target substring (to prevent {'a': '1'} to be applied to 'apple'). The cost is that performance drops sharply for longer dictionaries, taking almost four times as long as other options.

赋值表达式reduce 和简单循环replace 提供类似的性能(赋值表达式无法用更长的字典进行测试).考虑到可读性,string.replace 似乎是最好的选择.与正则表达式相比,这个问题的问题在于替换是按顺序发生的,而不是一次传递.所以 {'a': 'b', 'b': 'c'} 为字符串 'a' 返回 'c'.字典现在在 Python 中排序(但你可能想要 keep 使用OrderedDict) 以便您可以仔细设置替换顺序以避免出现问题.当然,对于 80k 次替换,您不能依赖于此.

Assignment expressions, reduce and simply looping replace offer similar performance (assignment expressions could not be tested with the longer dictionary). Taking readability into account, string.replace seems like the best option. The problem with this, compared to regular expressions, is that substitutions happen sequentially, not in a single pass. So {'a': 'b', 'b': 'c'} returns 'c' for string 'a'. Dictionaries are now ordered in Python (but you may want to keep using OrderedDict) so you can set the order of substitutions carefully to avoid problems. Of course, with 80k substitutions you cannot rely on this.

我目前正在使用带替换的循环,并进行一些预处理以尽量减少麻烦.我在标点符号的两侧添加空格(也在包含标点符号的项目的字典中).然后我可以搜索由空格包围的子字符串,并插入带有空格的替换.当您的目标是多个词时,这也适用:

I am currently using a loop with replace, and doing some preprocessing to minimize trouble. I am adding spaces at both sides of punctuation (also in the dictionary for items containing punctuation). Then I can search for substrings surrounded by spaces, and insert substitutions with spaces as well. This also works when your targets are multiple words:

string = 'This is: an island'
my_dict = {'is': 'is not', 'an island': 'a museum'}

使用替换和正则表达式我得到 string = ' This is : an island ' 这样我的替换循环

Using replace and regular expressions I get string = ' This is : an island ' so that my replace loop

for original, replacement in self.my_dict.items():
    string = string.replace(f' {original} ', f' {replacement} ')

返回 ' 这不是:博物馆 ' 预期.请注意,This"和island"中的is"被单独留下.正则表达式可用于修复标点符号,尽管我不需要这一步.

returns ' This is not : a museum ' as intended. Note that 'is' in 'This' and 'island' were left alone. Regular expressions could be used to fix punctuation back, although I don't require this step.

这篇关于有效地在字符串中进行多次多次替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆