如何删除 Python 中的重复短语? [英] How to remove duplicate phrases in Python?

查看:65
本文介绍了如何删除 Python 中的重复短语?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个字符串,例如

Suppose I have a string such as

'I hate *some* kinds of duplicate. This string has a duplicate phrase, duplicate phrase.'

我想删除第二次出现的重复短语而不删除其组成部分的其他出现,例如duplicate的其他用法.

I want to remove the second occurrence of duplicate phrase without removing other occurrences of its constituent parts, such as the other use of duplicate.

此外,我需要删除所有潜在重复短语,而不仅仅是我事先知道的某些特定短语的重复.

Moreover, I need to remove all potential duplicate phrases, not just the duplicates of some specific phrase that I know in advance.

我发现了几篇关于类似问题的帖子,但都没有帮助我解决我的特定问题:

I have found several posts on similar problems, but none that have helped me solve my particular issue:

我曾希望从那里的最后一个链接中调整方法 (re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)) 用于我的目的,但无法弄清楚如何这样做.

I had hoped to adapt the approach from the last link there (re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)) for my purposes, but could not figure out how to do so.

如何从 Python 中的字符串中删除两个或多个单词的所有任意重复短语?

推荐答案

感谢大家的尝试和评论.我终于找到了解决方案:

Thanks everyone for your attempts and comments. I have finally found a solution:

s = 'I hate *some* kinds of duplicate. This string has a duplicate phrase, duplicate phrase.'
re.sub(r'((\b\w+\b.{1,2}\w+\b)+).+\1', r'\1', s, flags = re.I)
# 'I hate *some* kinds of duplicate. This string has a duplicate phrase.'

说明

正则表达式

r'((\b\w+\b.{1,2}\w+\b)+).+\1'

查找由一个或两个 [任何字符] 分隔的多个字母数字字符的每次出现(涵盖单词不仅由空格分隔,还可能是句点或逗号和空格分隔的情况),然后重复跟随一些不确定长度的[任何字符].然后

finds every occurrence of multiple runs of alphanumeric characters separated by one or two [any character] (to cover the case where words are separated not just by a space, but perhaps a period or comma and a space), and then repeated following some run of [any character] of indeterminate length. Then

re.sub(r'((\b\w+\b.{1,2}\w+\b)+).+\1', r'\1', s, flags = re.I)

用由一个或两个 [任何字符] 分隔的第一个多组字母数字字符替换此类事件,确保忽略大小写(因为重复的短语有时会出现在句子的开头).

replaces such occurrences with the first multiple run of alphanumeric characters separated by one or two [any character], being sure to ignore case (since the duplicate phrase could sometimes occur at the beginning of a sentence).

这篇关于如何删除 Python 中的重复短语?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆