使用正则表达式JAVA将文本拆分为段落 [英] Splitting text into paragraphs with regex JAVA

查看:62
本文介绍了使用正则表达式JAVA将文本拆分为段落的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含一些数据的文本文件.所有段落均以四个空格开头.我的目的是将这段文本分成几段.

I hava text file that contains some data. All paragraphs start with four spaces. My aim is to split this text into paragraphs.

首先,我使用以下内容阅读了全文:

First, I read the whole text using:

    public String parseToString(String filePath) throws  IOException{
        return new String(Files.readAllBytes(Paths.get(filePath)), StandardCharsets.UTF_8);
    }

然后我使用以下代码分割字符串:

Then I use this code to split the string:

    private static final String PARAGRAPH_SPLIT_REGEX = "(^\\s{4})";
    public void parseText(String text) {
        String[] paragraphs = text.split(PARAGRAPH_SPLIT_REGEX);
        for (int i = 0; i < paragraphs.length; i++) {
            System.out.println("Paragraph: " + paragraphs[i]);
        }
    }

我的输入文件是:

    Hello, World!
    Hello, World!

输出为:

Paragraph: 
Paragraph: Hello, World!!!
    Hello, World!!!

我在做什么错了?

推荐答案

^ 默认情况下表示字符串的开头,而不是行的开头.如果要使其代表行的开头,则需要添加对您的正则表达式(?m) multiline 标志.

^ by default represents start of the string, not start of the line. If you want to it to represent start of the line you need to add multiline flag to your regex (?m).

也可以考虑使用Java 8中的预读功能,

Also consider using look-ahead which in Java 8 will automatically get rid of first empty result in your split array.

因此,请尝试使用此正则表达式:

So try with this regex:

private static final String PARAGRAPH_SPLIT_REGEX = "(?m)(?=^\\s{4})";

要摆脱字符串开头或结尾的多余分隔符(如空格或换行),您可以简单地使用 trim 方法,如

To get rid of unwanted separators like spaces or new lines at start or end of your string you can simply use trim method like

public static void parseText(String text) {
    String[] paragraphs = text.split(PARAGRAPH_SPLIT_REGEX);
    for (String paragraph : paragraphs) {
        System.out.println("Paragraph: " + paragraph.trim());
    }
}

示例:

 String s = 
        "    Hello, World!\r\n" + 
        "    Hello, World!\r\n" + 
        "    Hello, World!";
 parseText(s);

输出:

Paragraph: Hello, World!
Paragraph: Hello, World!
Paragraph: Hello, World!


Java 8之前的版本:


Pre Java 8 version:

如果您需要在Java的较早版本上使用此代码,则需要防止在字符串开头分割(以防止第一个元素为空).为此,您可以在miltiline标志之前使用(?!^).这样,在(?m)之前的 ^ 仍然只能表示字符串的开头,而不是行的开头.或者更明确地说,您可以使用 \ A 来表示String的开始,而不管多行标志如何.

If you need to use this code on older versions of Java then you will need to prevent splitting at start of the string (to avoid getting first element empty). To do this you can use (?!^) before miltiline flag. This way ^ before (?m) can still be representing only start of string, not start of the line. Or to be more explicit you can use \A which represents start of String regardless of multiline flag.

因此Java 8之前的正则表达式看起来像

So pre Java 8 version of regex can look like

private static final String PARAGRAPH_SPLIT_REGEX = "(?!^)(?m)(?=^\\s{4})";

private static final String PARAGRAPH_SPLIT_REGEX = "(?m)(?!\\A)(?=^\\s{4})";

这篇关于使用正则表达式JAVA将文本拆分为段落的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆