从字符串中删除多个子字符串的最有效方法? [英] Most efficient way to remove multiple substrings from string?

查看:44
本文介绍了从字符串中删除多个子字符串的最有效方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从字符串中删除子字符串列表最有效的方法是什么?

我想要一种更简洁、更快捷的方法来执行以下操作:

words = 'word1 word2 word3 word4, word5'replace_list = ['word1', 'word3', 'word5']def remove_multiple_strings(cur_string, replace_list):对于 replace_list 中的 cur_word:cur_string = cur_string.replace(cur_word, '')返回cur_stringremove_multiple_strings(单词,replace_list)

解决方案

Regex:

<预><代码>>>>进口重新>>>re.sub(r'|'.join(map(re.escape, replace_list)), '', words)' word2 word4, '

上面的单行代码实际上没有你的 string.replace 版本快,但肯定更短:

<预><代码>>>>words = ' '.join([hashlib.sha1(str(random.random())).hexdigest()[:10] for _ in xrange(10000)])>>>replace_list = words.split()[:1000]>>>random.shuffle(replace_list)>>>%timeit remove_multiple_strings(words, replace_list)10 个循环,最好的 3 个:每个循环 49.4 毫秒>>>%timeit re.sub(r'|'.join(map(re.escape, replace_list)), '', words)1 个循环,最好的 3 个:每个循环 623 毫秒

天哪!几乎慢了 12 倍.

但是我们可以改进它吗?是的.

因为我们只关心单词,所以我们可以做的是简单地使用 \w+words 字符串中过滤掉单词,并将其与一组 进行比较replace_list(是一个实际的set:set(replace_list)):

<预><代码>>>>定义子(米):return '' if m.group() in s else m.group()>>>%%时间s = 设置(替换列表)re.sub(r'\w+', sub, words)...100 个循环,最好的 3 个:每个循环 7.8 毫秒

对于更大的字符串和单词,string.replace 方法和我的第一个解决方案最终将花费二次时间,但该解决方案应该在线性时间内运行.

What's the most efficient method to remove a list of substrings from a string?

I'd like a cleaner, quicker way to do the following:

words = 'word1 word2 word3 word4, word5'
replace_list = ['word1', 'word3', 'word5']

def remove_multiple_strings(cur_string, replace_list):
  for cur_word in replace_list:
    cur_string = cur_string.replace(cur_word, '')
  return cur_string

remove_multiple_strings(words, replace_list)

解决方案

Regex:

>>> import re
>>> re.sub(r'|'.join(map(re.escape, replace_list)), '', words)
' word2  word4, '

The above one-liner is actually not as fast as your string.replace version, but definitely shorter:

>>> words = ' '.join([hashlib.sha1(str(random.random())).hexdigest()[:10] for _ in xrange(10000)])
>>> replace_list = words.split()[:1000]
>>> random.shuffle(replace_list)
>>> %timeit remove_multiple_strings(words, replace_list)
10 loops, best of 3: 49.4 ms per loop
>>> %timeit re.sub(r'|'.join(map(re.escape, replace_list)), '', words)
1 loops, best of 3: 623 ms per loop

Gosh! Almost 12x slower.

But can we improve it? Yes.

As we are only concerned with words what we can do is simply filter out words from the words string using \w+ and compare it against a set of replace_list(yes an actual set: set(replace_list)):

>>> def sub(m):
    return '' if m.group() in s else m.group()
>>> %%timeit
s = set(replace_list)
re.sub(r'\w+', sub, words)
...
100 loops, best of 3: 7.8 ms per loop

For even larger string and words the string.replace approach and my first solution will end up taking quadratic time, but the solution should run in linear time.

这篇关于从字符串中删除多个子字符串的最有效方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆