从带引号的字符串中提取键值对 [英] Extracting key value pairs from string with quotes

查看:55
本文介绍了从带引号的字符串中提取键值对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在为这个要求编写一个优雅"的解析器时遇到了麻烦.(一个看起来不像一块 C 早餐的东西).输入是一个字符串,键值对由','分隔并连接'='.

I am having trouble coding an 'elegant' parser for this requirement. (One that does not look like a piece of C breakfast). The input is a string, key value pairs separated by ',' and joined '='.

key1=value1,key2=value2

欺骗我的部分是值可以被引用 (") ,并且在引号内的 ',' 不会结束键.

The part tricking me is values can be quoted (") , and inside the quotes ',' does not end the key.

key1=value1,key2="value2,still_value2"

这最后一部分让我很难使用 split 或 re.split,求助于 for i in range for 循环 :(.

This last part has made it tricky for me to use split or re.split, resorting to for i in range for loops :(.

谁能演示一个干净的方法来做到这一点?

Can anyone demonstrate a clean way to do this?

可以假设引号只出现在值中,并且没有空格或非字母数字字符.

It is OK to assume quotes happen only in values, and that there is no whitespace or non alphanumeric characters.

推荐答案

我建议不要在此任务中使用正则表达式,因为您要解析的语言不是正则的.

I would advise against using regular expressions for this task, because the language you want to parse is not regular.

您有一个包含多个键值对的字符串.解析它的最好方法不是匹配它的模式,而是正确地标记它.

You have a character string of multiple key value pairs. The best way to parse this is not to match patterns on it, but to properly tokenize it.

Python 标准库中有一个名为 shlex 的模块,它模仿 POSIX shell 完成的解析,并提供了一个词法分析器实现,可以轻松地根据您的需要进行定制.

There is a module in the Python standard library, called shlex, that mimics the parsing done by POSIX shells, and that provides a lexer implementation that can easily be customized to your needs.

from shlex import shlex

def parse_kv_pairs(text, item_sep=",", value_sep="="):
    """Parse key-value pairs from a shell-like text."""
    # initialize a lexer, in POSIX mode (to properly handle escaping)
    lexer = shlex(text, posix=True)
    # set ',' as whitespace for the lexer
    # (the lexer will use this character to separate words)
    lexer.whitespace = item_sep
    # include '=' as a word character 
    # (this is done so that the lexer returns a list of key-value pairs)
    # (if your option key or value contains any unquoted special character, you will need to add it here)
    lexer.wordchars += value_sep
    # then we separate option keys and values to build the resulting dictionary
    # (maxsplit is required to make sure that '=' in value will not be a problem)
    return dict(word.split(value_sep, maxsplit=1) for word in lexer)

示例运行:

parse_kv_pairs(
  'key1=value1,key2=\'value2,still_value2,not_key1="not_value1"\''
)

输出:

{'key1': 'value1', 'key2': 'value2,still_value2,not_key1="not_value1"'}

我忘了补充,我通常坚持使用 shlex 而不是使用正则表达式(在这种情况下速度更快)的原因是它给你的惊喜更少,特别是如果你需要允许更多可能的输入.我从未找到如何使用正则表达式正确解析此类键值对,总会有输入(例如:A="B=\"1,2,3\"")会欺骗引擎.

I forgot to add that the reason I usually stick with shlex rather than using regular expressions (which are faster in this case) is that it gives you less surprises, especially if you need to allow more possible inputs later on. I never found how to properly parse such key-value pairs with regular expressions, there will always be inputs (ex: A="B=\"1,2,3\"") that will trick the engine.

如果你不关心这样的输入,(或者,换句话说,如果你能确保你的输入遵循正则语言的定义),正则表达式完全没问题.

If you do not care about such inputs, (or, put another way, if you can ensure that your input follows the definition of a regular language), regular expressions are perfectly fine.

split 有一个 maxsplit 参数,它比拆分/切片/连接更简洁.感谢@cdlane 的声音输入!

split has a maxsplit argument, that is much more cleaner to use than splitting/slicing/joining. Thanks to @cdlane for his sound input !

这篇关于从带引号的字符串中提取键值对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆