使用 Python 删除小词 [英] Remove small words using Python

查看:32
本文介绍了使用 Python 删除小词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以使用正则表达式删除文本中的小词?例如,我有以下字符串(文本):

anytext = " 来自安大略二人组的回声室 "

我想删除所有不超过 3 个字符的单词.结果应该是:

来自安大略的回音室"

是否可以使用正则表达式或任何其他 python 函数来做到这一点?

谢谢.

解决方案

当然,这也不难:

shortword = re.compile(r'\W*\b\w{1,3}\b')

上述表达式选择前面有一些非单词字符(本质上是空格或开头)、长度在 1 到 3 个字符之间并以单词边界结束的任何单词.

<预><代码>>>>shortword.sub('', anytext)'来自安大略的回声室'

\b 边界匹配在这里很重要,它们确保您不会只匹配单词的前 3 个或最后 3 个字符.

开头的 \W* 允许您删除单词和前面的非单词字符,以便句子的其余部分仍然匹配.请注意,标点符号包含在 \W 中,如果您想删除前面的空格,请使用 \s.

就其价值而言,此正则表达式解决方案保留其余单词之间的额外空白,而 mgilson 的版本将多个空白字符合并为一个空格.不确定这对您是否重要.

他的列表理解解决方案是两者中更快的:

<预><代码>>>>导入时间>>>def re_remove(text): return shortword.sub('', text)...>>>def lc_remove(text): return ' '.join(word for word in text.split() if len(word)>3)...>>>timeit.timeit('remove(" in the echo Chamber from Ontario duo ")', 'from __main__ import re_remove as remove')7.0774190425872803>>>timeit.timeit('remove(" in the echo Chamber from Ontario duo ")', 'from __main__ import lc_remove as remove')6.4250049591064453

Is it possible use regex to remove small words in a text? For example, I have the following string (text):

anytext = " in the echo chamber from Ontario duo "

I would like remove all words that is 3 characters or less. The Result should be:

"echo chamber from Ontario"

Is it possible do that using regular expression or any other python function?

Thanks.

解决方案

Certainly, it's not that hard either:

shortword = re.compile(r'\W*\b\w{1,3}\b')

The above expression selects any word that is preceded by some non-word characters (essentially whitespace or the start), is between 1 and 3 characters short, and ends on a word boundary.

>>> shortword.sub('', anytext)
' echo chamber from Ontario '

The \b boundary matches are important here, they ensure that you don't match just the first or last 3 characters of a word.

The \W* at the start lets you remove both the word and the preceding non-word characters so that the rest of the sentence still matches up. Note that punctuation is included in \W, use \s if you only want to remove preceding whitespace.

For what it's worth, this regular expression solution preserves extra whitespace between the rest of the words, while mgilson's version collapses multiple whitespace characters into one space. Not sure if that matters to you.

His list comprehension solution is the faster of the two:

>>> import timeit
>>> def re_remove(text): return shortword.sub('', text)
... 
>>> def lc_remove(text): return ' '.join(word for word in text.split() if len(word)>3)
... 
>>> timeit.timeit('remove(" in the echo chamber from Ontario duo ")', 'from __main__ import re_remove as remove')
7.0774190425872803
>>> timeit.timeit('remove(" in the echo chamber from Ontario duo ")', 'from __main__ import lc_remove as remove')
6.4250049591064453

这篇关于使用 Python 删除小词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆