删除字符串中连续重复的单词 [英] Removing consecutive duplicate words in a string
问题描述
我正在尝试编写一个删除字符串中连续重复单词的函数.保留正则表达式找到的任何匹配项至关重要.换句话说...
I am trying to write a function that removes consecutive duplicate words within a string. It's vital that one any matches found by the regular expression remains. In other words...
一只非常非常非常脏的狗
A very very very dirty dog
应该变成……
一只很脏的狗
我有一个似乎运行良好的正则表达式(基于这篇文章一>)
I have a regular expression that seems to work well (based on this post)
(\b\S+\b)(($|\s+)\1)+
但是我不确定如何使用 preg_replace (或者是否有更好的功能)来实现这一点.现在我让它删除所有匹配的重复单词而不保留一个完整的单词副本.我可以解析变量或特殊指令以保持匹配吗?
However I'm not sure how to use preg_replace (or if there's a better function) to implement this. Right now I have it deleting all matching repeated words without leaving one copy of the word intact. Can I parse a variable or special instruction to it to keep a match ?
我目前有这个...
$string=preg_replace('/(\b\S+\b)(($|\s+)\1)+/', '', $string);
推荐答案
你可以使用像 \b(\S+)(?:\s+\1\b)+
这样的正则表达式并替换为$1
:
You may use a regex like \b(\S+)(?:\s+\1\b)+
and replace with $1
:
$string=preg_replace('/\b(\S+)(?:\s+\1\b)+/i', '$1', $string);
查看正则表达式演示
详情:
\b(\S+)
- 第 1 组捕获一个或多个以单词边界开头的非空白符号(可能\b(\w+)
更适合这里)(?:\s+\1\b)+
- 1 个或多个序列:\s+
- 1 个或多个空格\1\b
- 对存储在 Group 1 缓冲区中的值的反向引用(该值必须是一个完整的单词)
\b(\S+)
- Group 1 capturing one or more non-whitespace symbols that are preceded with a word boundary (maybe\b(\w+)
would suit better here)(?:\s+\1\b)+
- 1 or more sequences of:\s+
- 1 or more whitespaces\1\b
- a backreference to the value stored in Group 1 buffer (the value must be a whole word)
替换模式是
$1
,替换后向引用引用存储在 Group 1 缓冲区中的值.The replacement pattern is
$1
, the replacement backreference that refers to the value stored in Group 1 buffer.注意
/i
不区分大小写修饰符将使\1
不区分大小写,而I have a dog Dog DOG
将导致我有一只狗
.Note that
/i
case insensitive modifier will make\1
case insensitive, andI have a dog Dog DOG
will result inI have a dog
.这篇关于删除字符串中连续重复的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!