匹配键值模式正则表达式 [英] Match key-value pattern regex

查看:144
本文介绍了匹配键值模式正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个键值解析器,其中输入字符串采用键的形式:value,key2:value。键可以包含字符 az AZ 0-9 和值可以包含任何字符,但 \ 需要以反斜杠作为前缀。逗号用于分隔键值对,但在最后一对之后不需要。

I am making a key-value parser where the input string takes the form of key:"value",key2:"value". Keys can contain the characters a-z, A-Z and 0-9 and values can contain any character but :, ,," and \ need to be prefixed with a backslash. Commas are used to separate the key-value pairs but are not needed after the last pair.

到目前为止,我有([a-zA-Z0-9] +):(。*)哪个最匹配键和值但显然它不能处理多个对或者任何控制字符未转义。(?< = \\)[:,\ \] 似乎匹配所有转义字符,但它不匹配任何普通字符。

So far I have ([a-zA-Z0-9]+):"(.*)" which will match most keys and values but obviously it wont be able to handle more than a single pair or if any of the 'control' characters go unescaped. (?<=\\)[:,"\\] seems to match all escaped characters but it will not match any 'normal' characters.

有没有办法检查逗号分隔并匹配所有转义的控制字符以及普通字符?这是否更适合没有正则表达式的实现,还是需要按顺序排列多个模式?

Is there a way to check for comma separation and to match all escaped 'control' characters as well as normal ones? Is this something that would be better suited to implementation without regex or would this need multiple patterns in sequence?

一些例子:

输入: joe:bread,sam:fish输出: joe - >面包 sam - >鱼

输入:乔:看那边,它是鲨鱼!,山姆:我喜欢鱼。输出:乔 - >看那边,这是一条鲨鱼! sam - >我喜欢鱼

推荐答案

假设 \ 后跟除行终止符之外的任何字符指定紧随其后的字符。

Assuming that \ followed by any character except for line terminator specifies the character immediately following it.

您可以使用以下正则表达式匹配键值对的所有实例:

You can use the following regex to match all instances of key-value pairs:

"([a-zA-Z0-9]+):\"((?:[^\\\\\"]|\\\\.)*+)\""

之前和之后添加 \\\\ * 如果要允许空闲间距。

Add \\s* before and after : if you want to allow free spacing.

这是正则表达式引擎看到的:

This is what the regex engine sees:

([a-zA-Z0-9]+):"((?:[^\\"]|\\.)*+)"

量词 * 为占有 * + ,因为2分支 [^ \\] \\。是互斥的(不能匹配任何字符串)两者同时)。它也避免了 StackOverflowError 在Oracle的 Pattern 类的实现中。

The quantifier * is made possessive *+, since the 2 branches [^\\"] and \\. are mutual exclusive (no string can be matched by both at the same time). It also avoids StackOverflowError in the Oracle's implementation of Pattern class.

在Matcher循环中使用上面的正则表达式:

Use the regex above in a Matcher loop:

Pattern keyValuePattern = Pattern.compile("([a-zA-Z0-9]+):\"((?:[^\\\\\"]|\\\\.)*+)\"");
Matcher matcher = keyValuePattern.matcher(inputString);

while (matcher.find()) {
    String key = matcher.group(1);

    // Process the escape sequences in the value string
    String value = matcher.group(2).replaceAll("\\\\(.)", "$1");

    // ...
}

一般情况,取决于转义序列的复杂性(例如 \ n \ uhhhhh \ xhh \0 ),您可能想要编写一个单独的函数来解析它们。但是,根据上面的假设,单线程就足够了。

In general case, depending on the complexity of the escape sequences (e.g. \n, \uhhhh, \xhh, \0), you might want to write a separate function to parse them. However, with the assumption above, the one-liner suffices.

请注意,此解决方案并不关心分隔符。它将跳过无效输入到最近的匹配。在下面的无效输入示例中,上面的解决方案将在开头跳过 abc:并愉快地匹配 xyz:text text amd 更多:pair作为键值对:

Note that this solution doesn't care about the separators, though. And it will skip on invalid input to the nearest match. In the example of invalid input below, the solution above will skip abc:" at the beginning and happily match xyz:"text text" amd more:"pair" as key-value pairs:

abc:"xyz:"text text", more:"pair"

如果这种行为是不可取的,有一个解决方案,但必须首先隔离包含所有键值对的字符串,而不是与键值对没有任何关系的更大字符串的一部分:

If this behavior is not desirable, there is a solution, but the string containing all the key-value pairs must be isolated first, instead of being part of a bigger string that doesn't have anything to do with key-value pairs:

"(?:^|(?!^)\\G,)([a-zA-Z0-9]+):\"((?:[^\\\\\"]|\\\\.)*+)\""

自由空间版本:

"(?:^\s*|(?!^)\\G\s*,\s*)([a-zA-Z0-9]+)\s*:\s*\"((?:[^\\\\\"]|\\\\.)*+)\""

这篇关于匹配键值模式正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆