拆分嵌套的字符串并保留引号 [英] Splitting a nested string keeping quotation marks

查看:60
本文介绍了拆分嵌套的字符串并保留引号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Java进行项目,该项目需要嵌套字符串.

I am working on a project in Java that requires having nested strings.

对于纯文本形式的输入字符串,如下所示:

For an input string that in plain text looks like this:

这是字符串",这是"\"嵌套\"字符串"

This is "a string" and this is "a \"nested\" string"

结果必须是以下内容:

[0] This
[1] is
[2] "a string"
[3] and
[4] this
[5] is
[6] "a \"nested\" string"

注意,我希望保留 \"序列.
我有以下方法:

Note that I want the \" sequences to be kept.
I have the following method:

public static String[] splitKeepingQuotationMarks(String s);

,我需要根据给定的规则,根据给定的 s 参数创建一个字符串数组,而无需使用 Java集合框架或其派生类.

and I need to create an array of strings out of the given s parameter by the given rules, without using the Java Collection Framework or its derivatives.

我不确定如何解决此问题.
可以制作一个正则表达式来解决这个问题吗?

I am unsure about how to solve this problem.
Can a regex expression be made that would get this solved?

根据评论中的问题进行更新:

  • 每个未转义的"都有其结尾的未转义的" (它们是平衡的)
  • 如果要创建表示它的文字,还必须转义
  • 每个转义字符 \ (要创建表示 \ 的文本,我们需要将其写为 \\).
  • each unescaped " has its closing unescaped " (they are balanced)
  • each escaping character \ also must be escaped if we want to create literal representing it (to create text representing \ we need to write it as \\).

推荐答案

您可以使用以下正则表达式:

You can use the following regex:

"[^"\\]*(?:\\.[^"\\]*)*"|\S+

请参见 regex演示

Java演示:

String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\""; 
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
    System.out.println(matcher.group(0));
}

说明:

  • "[^" \\] *(?:\\.[^"\\] *)*" -双引号,后跟除<以外的任何0+字符code>"和 \ ( [^" \\] ),后跟0+个任意转义序列的序列( \\ .. ),后接除" \
  • 以外的任何0+字符
  • | -或...
  • \ S + -1个或多个非空白字符
  • "[^"\\]*(?:\\.[^"\\]*)*" - a double quote that is followed with any 0+ characters other than a " and \ ([^"\\]) followed with 0+ sequences of any escaped sequence (\\.) followed with any 0+ characters other than a " and \
  • | - or...
  • \S+ - 1 or more non-whitespace characters

注意

@Pshemo的建议-"\"(?:\\\\.| [^ \])* \" | \\ S +"(或 会更正确)-是相同的表达式,但效率要低得多,因为它使用了以 * 量化的交替组.由于正则表达式引擎必须测试每个位置,所以此构造涉及更多的回溯,每个位置有2个概率.我基于 unroll-the-loop 的版本可以一次匹配大块文本,因此更加快捷,可靠.

@Pshemo's suggestion - "\"(?:\\\\.|[^\"])*\"|\\S+" (or "\"(?:\\\\.|[^\"\\\\])*\"|\\S+" would be more correct) - is the same expression, but much less efficient since it is using an alternation group quantified with *. This construct involves much more backtracking as the regex engine has to test each position, and there are 2 probabilities for each position. My unroll-the-loop based version will match chunks of text at once, and is thus much faster and reliable.

更新

由于需要 String [] 类型作为输出,因此您需要分两个步骤完成操作:计算匹配项,创建数组,然后再次运行匹配器:

Since String[] type is required as output, you need to do it in 2 steps: count the matches, create the array, and then re-run the matcher again:

int cnt = 0;
String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\""; 
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
    cnt++;
}
System.out.println(cnt);
String[] result = new String[cnt];
matcher.reset();
int idx = 0;
while (matcher.find()) {
    result[idx] = matcher.group(0);
    idx++;
}
System.out.println(Arrays.toString(result));

请参见另一个IDEONE演示

这篇关于拆分嵌套的字符串并保留引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆