拆分嵌套的字符串并保留引号 [英] Splitting a nested string keeping quotation marks
问题描述
我正在使用Java进行项目,该项目需要嵌套字符串.
I am working on a project in Java that requires having nested strings.
对于纯文本形式的输入字符串,如下所示:
For an input string that in plain text looks like this:
这是字符串",这是"\"嵌套\"字符串"
This is "a string" and this is "a \"nested\" string"
结果必须是以下内容:
[0] This
[1] is
[2] "a string"
[3] and
[4] this
[5] is
[6] "a \"nested\" string"
注意,我希望保留 \"
序列.
我有以下方法:
Note that I want the \"
sequences to be kept.
I have the following method:
public static String[] splitKeepingQuotationMarks(String s);
,我需要根据给定的规则,根据给定的 s
参数创建一个字符串数组,而无需使用 Java集合框架或其派生类.
and I need to create an array of strings out of the given s
parameter by the given rules, without using the Java Collection Framework or its derivatives.
我不确定如何解决此问题.
可以制作一个正则表达式来解决这个问题吗?
I am unsure about how to solve this problem.
Can a regex expression be made that would get this solved?
根据评论中的问题进行更新:
- 每个未转义的
"
都有其结尾的未转义的"
(它们是平衡的) 如果要创建表示它的文字,还必须转义 - 每个转义字符
\
(要创建表示\
的文本,我们需要将其写为\\
).
- each unescaped
"
has its closing unescaped"
(they are balanced) - each escaping character
\
also must be escaped if we want to create literal representing it (to create text representing\
we need to write it as\\
).
推荐答案
您可以使用以下正则表达式:
You can use the following regex:
"[^"\\]*(?:\\.[^"\\]*)*"|\S+
请参见 regex演示
String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\"";
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
说明:
-
"[^" \\] *(?:\\.[^"\\] *)*"
-双引号,后跟除<以外的任何0+字符code>"和\
([^" \\]
),后跟0+个任意转义序列的序列(\\ ..
),后接除"
和\
以外的任何0+字符 -
|
-或... -
\ S +
-1个或多个非空白字符
"[^"\\]*(?:\\.[^"\\]*)*"
- a double quote that is followed with any 0+ characters other than a"
and\
([^"\\]
) followed with 0+ sequences of any escaped sequence (\\.
) followed with any 0+ characters other than a"
and\
|
- or...\S+
- 1 or more non-whitespace characters
注意
@Pshemo的建议-"\"(?:\\\\.| [^ \])* \" | \\ S +"
(或 会更正确)-是相同的表达式,但效率要低得多,因为它使用了以 *
量化的交替组.由于正则表达式引擎必须测试每个位置,所以此构造涉及更多的回溯,每个位置有2个概率.我基于 unroll-the-loop 的版本可以一次匹配大块文本,因此更加快捷,可靠.
@Pshemo's suggestion - "\"(?:\\\\.|[^\"])*\"|\\S+"
(or "\"(?:\\\\.|[^\"\\\\])*\"|\\S+"
would be more correct) - is the same expression, but much less efficient since it is using an alternation group quantified with *
. This construct involves much more backtracking as the regex engine has to test each position, and there are 2 probabilities for each position. My unroll-the-loop based version will match chunks of text at once, and is thus much faster and reliable.
更新
由于需要 String []
类型作为输出,因此您需要分两个步骤完成操作:计算匹配项,创建数组,然后再次运行匹配器:
Since String[]
type is required as output, you need to do it in 2 steps: count the matches, create the array, and then re-run the matcher again:
int cnt = 0;
String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\"";
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
cnt++;
}
System.out.println(cnt);
String[] result = new String[cnt];
matcher.reset();
int idx = 0;
while (matcher.find()) {
result[idx] = matcher.group(0);
idx++;
}
System.out.println(Arrays.toString(result));
请参见另一个IDEONE演示
这篇关于拆分嵌套的字符串并保留引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!