使用正则表达式从文本中删除连续重复的单词并显示新文本

Z时代
2024-01-10
分类：问答

我有以下代码：

import java.io.*;
import java.util.ArrayList;
import java.util.Scanner;
import java.util.regex.*;
/
public  class RegexSimple4
{
     public static void main(String[] args) {
          try {
              Scanner myfis = new Scanner(new File("D:\\myfis32.txt"));
              ArrayList <String> foundaz = new ArrayList<String>();
              ArrayList <String> noduplicates = new ArrayList<String>();
              while(myfis.hasNext()) {
                  String line = myfis.nextLine();
                  String delim = " ";
                  String [] words = line.split(delim);
                  for (String s : words) {                    
                      if (!s.isEmpty() && s != null) {
                          Pattern pi = Pattern.compile("[aA-zZ]*");
                          Matcher ma = pi.matcher(s);
                          if (ma.find()) {
                              foundaz.add(s);
                          }
                      }
                  }
              }
              if(foundaz.isEmpty()) {
                  System.out.println("No words have been found");
              }
              if(!foundaz.isEmpty()) {
                  int n = foundaz.size();
                  String plus = foundaz.get(0);
                  noduplicates.add(plus);
                  for(int i=1; i<n; i++) {   
                      if ( !noduplicates.get(i-1) .equalsIgnoreCase(foundaz.get(i))) {
                          noduplicates.add(foundaz.get(i));
                      }
                  }
                  //System.out.print("Cuvantul/cuvintele \n"+i);
              }
              if(!foundaz.isEmpty()) { 
                  System.out.print("Original text \n");
                  for(String s: foundaz) {
                      System.out.println(s);
                  }
              }
              if(!noduplicates.isEmpty()) {
                  System.out.print("Remove duplicates\n");
                  for(String s: noduplicates) {
                      System.out.println(s);
                  }
              }
          } catch(Exception ex) {
              System.out.println(ex); 
          }
      }
  }

目的是从短语中删除连续的重复项。该代码仅适用于一列字符串，不适用于全长短语。

例如，我的输入应为：

布拉布拉狗猫老鼠。猫老鼠狗狗。

和输出

布拉狗猫老鼠。猫老鼠狗。

真诚的

回答：

首先，正则表达式[aA-zZ]*不会执行您认为的操作。这意味着“匹配零个或多个aS或字符ASCII之间的范围内A和ASCII

z（其还包括[，]，\及其它），或ZS”。因此，它也匹配空字符串。

假设您只在寻找不重复的单词，该单词仅由ASCII字母组成，不区分大小写，保留第一个单词（这意味着您不希望匹配"it's it's"或"olé

olé!"），那么您可以在单个regex操作中做到这一点：

String result = subject.replaceAll("(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+", "$1");

将会改变

Hello hello Hello there there past pastures

进入

Hello there past pastures

(?i) # Mode: case-insensitive \b # Match the start of a word ([a-z]+) # Match one ASCII "word", capture it in group 1 \b # Match the end of a word (?: # Start of non-capturing group: \s+ # Match at least one whitespace character \1 # Match the same word as captured before (case-insensitively) \b # and make sure it ends there. )+ # Repeat that as often as possible

看到它住在regex101.com。

以上是使用正则表达式从文本中删除连续重复的单词并显示新文本的全部内容，来源链接： utcz.com/qa/409590.html