有没有办法删除字符串中重复和连续的单词/短语? [英] Is there a way to remove duplicate and continuous words/phrases in a string?
问题描述
有没有办法删除字符串中的重复和连续词/短语?例如.
[in]: foo foo bar bar foo bar
[out]: foo bar foo bar
我已经试过了:
<预><代码>>>>s = '这是一只 foo bar bar 黑羊,你有什么羊毛吗,是的,先生,是的,三袋 woo wu 羊毛'>>>[i for i,j in zip(s.split(),s.split()[1:]) if i!=j]['this', 'is', 'a', 'foo', 'bar', 'black', 'sheep', ',', 'have', 'you', 'any', 'wool', 'woo', ',', 'yes', 'sir', 'yes', 'sir', '三', '包', '呜', 'wu']>>>" ".join([i for i,j in zip(s.split(),s.split()[1:]) if i!=j]+[s.split()[-1]])'这是一只foo bar黑羊,你有羊毛吗,是的先生是的先生三袋吴吴'当它变得有点复杂并且我想删除短语时会发生什么(假设短语最多可以由 5 个单词组成)?如何做呢?例如.
[in]: foo bar foo bar foo bar
[out]: foo bar
另一个例子:
[in]: this is a sentence 句子 句子 这是短语重复的句子 短语重复 .句子不是短语.
[out]: 这是一个短语重复的句子.句子不是短语.
你可以使用 re 模块.
<预><代码>>>>s = 'foo foo bar bar'>>>re.sub(r'\b(.+)\s+\1\b', r'\1', s)'富吧'>>>s = 'foo bar foo bar foo bar'>>>re.sub(r'\b(.+)\s+\1\b', r'\1', s)'foo bar foo bar'如果要匹配任意数量的连续出现:
<预><代码>>>>s = 'foo bar foo bar foo bar'>>>re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)'富吧'编辑.最后一个例子的补充.为此,您必须在有重复短语时调用 re.sub.所以:
<预><代码>>>>s = '这是一个句子句子句子这是一个短语短语重复的句子短语重复的句子'>>>而 re.search(r'\b(.+)(\s+\1\b)+', s):... s = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)...>>>秒'这是一个短语重复的句子'Is there a way to remove duplicate and continuous words/phrases in a string? E.g.
[in]: foo foo bar bar foo bar
[out]: foo bar foo bar
I have tried this:
>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> [i for i,j in zip(s.split(),s.split()[1:]) if i!=j]
['this', 'is', 'a', 'foo', 'bar', 'black', 'sheep', ',', 'have', 'you', 'any', 'wool', 'woo', ',', 'yes', 'sir', 'yes', 'sir', 'three', 'bag', 'woo', 'wu']
>>> " ".join([i for i,j in zip(s.split(),s.split()[1:]) if i!=j]+[s.split()[-1]])
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu'
What happens when it gets a little more complicated and i want to remove phrases (let's say phrases can be made up of up to 5 words)? how can it be done? E.g.
[in]: foo bar foo bar foo bar
[out]: foo bar
Another example:
[in]: this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .
[out]: this is a sentence where phrases duplicate . sentence are not prhases .
You can use re module for that.
>>> s = 'foo foo bar bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar'
>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar foo bar'
If you want to match any number of consecutive occurrences:
>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
'foo bar'
Edit. An addition for your last example. To do so you'll have to call re.sub while there're duplicate phrases. So:
>>> s = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'
>>> while re.search(r'\b(.+)(\s+\1\b)+', s):
... s = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
...
>>> s
'this is a sentence where phrases duplicate'
这篇关于有没有办法删除字符串中重复和连续的单词/短语?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!