使用正则表达式从文本中删除连续的重复单词并显示新文本 [英] Removing consecutive duplicates words out of text using Regex and displaying the new text

查看:142
本文介绍了使用正则表达式从文本中删除连续的重复单词并显示新文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Hy,

我有以下代码:

import java.io.*;
import java.util.ArrayList;
import java.util.Scanner;
import java.util.regex.*;

/
public  class RegexSimple4
{

     public static void main(String[] args) {   

          try {
              Scanner myfis = new Scanner(new File("D:\\myfis32.txt"));
              ArrayList <String> foundaz = new ArrayList<String>();
              ArrayList <String> noduplicates = new ArrayList<String>();

              while(myfis.hasNext()) {
                  String line = myfis.nextLine();
                  String delim = " ";
                  String [] words = line.split(delim);

                  for (String s : words) {                    
                      if (!s.isEmpty() && s != null) {
                          Pattern pi = Pattern.compile("[aA-zZ]*");
                          Matcher ma = pi.matcher(s);

                          if (ma.find()) {
                              foundaz.add(s);
                          }
                      }
                  }
              }

              if(foundaz.isEmpty()) {
                  System.out.println("No words have been found");
              }

              if(!foundaz.isEmpty()) {
                  int n = foundaz.size();
                  String plus = foundaz.get(0);
                  noduplicates.add(plus);
                  for(int i=1; i<n; i++) {   
                      if ( !noduplicates.get(i-1) .equalsIgnoreCase(foundaz.get(i))) {
                          noduplicates.add(foundaz.get(i));
                      }
                  }

                  //System.out.print("Cuvantul/cuvintele \n"+i);

              }
              if(!foundaz.isEmpty()) { 
                  System.out.print("Original text \n");
                  for(String s: foundaz) {
                      System.out.println(s);
                  }
              }
              if(!noduplicates.isEmpty()) {
                  System.out.print("Remove duplicates\n");
                  for(String s: noduplicates) {
                      System.out.println(s);
                  }
              }

          } catch(Exception ex) {
              System.out.println(ex); 
          }
      }
  }

目的是连续删除短语重复。该代码仅适用于不是全长短语的字符串列。

With the purpose of removing consecutive duplicates from phrases. The code works only for a column of strings not for full length phrases.

例如我的输入应为:

Blah blah dog cat tiger。
猫老鼠狗狗。

Blah blah dog cat mice. Cat mice dog dog.

输出


Blah狗猫老鼠。
猫老鼠狗。

Blah dog cat mice. Cat mice dog.

真诚地,

推荐答案

首先,正则表达式 [aA-zZ] * 不会按照您的想法执行。这意味着匹配零或更多 a s或ASCII A 和ASCII z (还包括 [] \\ \\ 和其他人),或 Z s。因此它也匹配空字符串。

First of all, the regex [aA-zZ]* doesn't do what you think it does. It means "Match zero or more as or characters in the range between ASCII A and ASCII z (which also includes [, ], \ and others), or Zs". It therefore also matches the empty string.

假设您只是寻找仅由ASCII字母组成的重复单词,不区分大小写,保留第一个单词(这意味着您不想匹配它是oléolé!),然后您可以在单个正则表达式操作中执行此操作:

Assuming that you are only looking for duplicate words that consists solely of ASCII letters, case-insensitively, keeping the first word (which means that you wouldn't want to match "it's it's" or "olé olé!"), then you can do that in a single regex operation:

String result = subject.replaceAll("(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+", "$1");

将改变

Hello hello Hello there there past pastures 

进入

Hello there past pastures 

说明:

(?i)     # Mode: case-insensitive
\b       # Match the start of a word
([a-z]+) # Match one ASCII "word", capture it in group 1
\b       # Match the end of a word
(?:      # Start of non-capturing group:
 \s+     # Match at least one whitespace character
 \1      # Match the same word as captured before (case-insensitively)
 \b      # and make sure it ends there.
)+       # Repeat that as often as possible

请参阅在regex101.com上直播

这篇关于使用正则表达式从文本中删除连续的重复单词并显示新文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆