使用正则表达式从文本中删除连续的重复单词并显示新文本 [英] Removing consecutive duplicates words out of text using Regex and displaying the new text
问题描述
Hy,
我有以下代码:
import java.io.*;
import java.util.ArrayList;
import java.util.Scanner;
import java.util.regex.*;
/
public class RegexSimple4
{
public static void main(String[] args) {
try {
Scanner myfis = new Scanner(new File("D:\\myfis32.txt"));
ArrayList <String> foundaz = new ArrayList<String>();
ArrayList <String> noduplicates = new ArrayList<String>();
while(myfis.hasNext()) {
String line = myfis.nextLine();
String delim = " ";
String [] words = line.split(delim);
for (String s : words) {
if (!s.isEmpty() && s != null) {
Pattern pi = Pattern.compile("[aA-zZ]*");
Matcher ma = pi.matcher(s);
if (ma.find()) {
foundaz.add(s);
}
}
}
}
if(foundaz.isEmpty()) {
System.out.println("No words have been found");
}
if(!foundaz.isEmpty()) {
int n = foundaz.size();
String plus = foundaz.get(0);
noduplicates.add(plus);
for(int i=1; i<n; i++) {
if ( !noduplicates.get(i-1) .equalsIgnoreCase(foundaz.get(i))) {
noduplicates.add(foundaz.get(i));
}
}
//System.out.print("Cuvantul/cuvintele \n"+i);
}
if(!foundaz.isEmpty()) {
System.out.print("Original text \n");
for(String s: foundaz) {
System.out.println(s);
}
}
if(!noduplicates.isEmpty()) {
System.out.print("Remove duplicates\n");
for(String s: noduplicates) {
System.out.println(s);
}
}
} catch(Exception ex) {
System.out.println(ex);
}
}
}
目的是连续删除短语重复。该代码仅适用于不是全长短语的字符串列。
With the purpose of removing consecutive duplicates from phrases. The code works only for a column of strings not for full length phrases.
例如我的输入应为:
Blah blah dog cat tiger。
猫老鼠狗狗。
Blah blah dog cat mice. Cat mice dog dog.
输出
Blah狗猫老鼠。
猫老鼠狗。
Blah dog cat mice. Cat mice dog.
真诚地,
推荐答案
首先,正则表达式 [aA-zZ] *
不会按照您的想法执行。这意味着匹配零或更多 a
s或ASCII A
和ASCII z
(还包括 [
,]
, \\ \\
和其他人),或 Z
s。因此它也匹配空字符串。
First of all, the regex [aA-zZ]*
doesn't do what you think it does. It means "Match zero or more a
s or characters in the range between ASCII A
and ASCII z
(which also includes [
, ]
, \
and others), or Z
s". It therefore also matches the empty string.
假设您只是寻找仅由ASCII字母组成的重复单词,不区分大小写,保留第一个单词(这意味着您不想匹配它是
或oléolé!
),然后您可以在单个正则表达式操作中执行此操作:
Assuming that you are only looking for duplicate words that consists solely of ASCII letters, case-insensitively, keeping the first word (which means that you wouldn't want to match "it's it's"
or "olé olé!"
), then you can do that in a single regex operation:
String result = subject.replaceAll("(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+", "$1");
将改变
Hello hello Hello there there past pastures
进入
Hello there past pastures
说明:
(?i) # Mode: case-insensitive
\b # Match the start of a word
([a-z]+) # Match one ASCII "word", capture it in group 1
\b # Match the end of a word
(?: # Start of non-capturing group:
\s+ # Match at least one whitespace character
\1 # Match the same word as captured before (case-insensitively)
\b # and make sure it ends there.
)+ # Repeat that as often as possible
请参阅在regex101.com上直播。
这篇关于使用正则表达式从文本中删除连续的重复单词并显示新文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!