使用正则表达式从文本中删除连续重复的单词并显示新文本

Hy

我有以下代码:

import java.io.*;

import java.util.ArrayList;

import java.util.Scanner;

import java.util.regex.*;

/

public class RegexSimple4

{

public static void main(String[] args) {

try {

Scanner myfis = new Scanner(new File("D:\\myfis32.txt"));

ArrayList <String> foundaz = new ArrayList<String>();

ArrayList <String> noduplicates = new ArrayList<String>();

while(myfis.hasNext()) {

String line = myfis.nextLine();

String delim = " ";

String [] words = line.split(delim);

for (String s : words) {

if (!s.isEmpty() && s != null) {

Pattern pi = Pattern.compile("[aA-zZ]*");

Matcher ma = pi.matcher(s);

if (ma.find()) {

foundaz.add(s);

}

}

}

}

if(foundaz.isEmpty()) {

System.out.println("No words have been found");

}

if(!foundaz.isEmpty()) {

int n = foundaz.size();

String plus = foundaz.get(0);

noduplicates.add(plus);

for(int i=1; i<n; i++) {

if ( !noduplicates.get(i-1) .equalsIgnoreCase(foundaz.get(i))) {

noduplicates.add(foundaz.get(i));

}

}

//System.out.print("Cuvantul/cuvintele \n"+i);

}

if(!foundaz.isEmpty()) {

System.out.print("Original text \n");

for(String s: foundaz) {

System.out.println(s);

}

}

if(!noduplicates.isEmpty()) {

System.out.print("Remove duplicates\n");

for(String s: noduplicates) {

System.out.println(s);

}

}

} catch(Exception ex) {

System.out.println(ex);

}

}

}

目的是从短语中删除连续的重复项。该代码仅适用于一列字符串,不适用于全长短语。

例如,我的输入应为:

布拉布拉狗猫老鼠。猫老鼠狗狗。

和输出

布拉狗猫老鼠。猫老鼠狗。

真诚的

回答:

首先,正则表达式[aA-zZ]*不会执行您认为的操作。这意味着“匹配零个或多个aS或字符ASCII之间的范围内A和ASCII

z(其还包括[]\及其它),或ZS”。因此,它也匹配空字符串。

假设您只在寻找不重复的单词,该单词仅由ASCII字母组成,不区分大小写,保留第一个单词(这意味着您不希望匹配"it's it's""olé

olé!"),那么您可以在单个regex操作中做到这一点:

String result = subject.replaceAll("(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+", "$1");

将会改变

Hello hello Hello there there past pastures

进入

Hello there past pastures

(?i)     # Mode: case-insensitive

\b # Match the start of a word

([a-z]+) # Match one ASCII "word", capture it in group 1

\b # Match the end of a word

(?: # Start of non-capturing group:

\s+ # Match at least one whitespace character

\1 # Match the same word as captured before (case-insensitively)

\b # and make sure it ends there.

)+ # Repeat that as often as possible

看到它住在regex101.com。

以上是 使用正则表达式从文本中删除连续重复的单词并显示新文本 的全部内容, 来源链接: utcz.com/qa/409590.html

回到顶部