将键=值对转换回Python字典 [英] Converting key=value pairs back into Python dicts

查看:592
本文介绍了将键=值对转换回Python字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个日志文件,其中文本以空格分隔的key=value对形式出现,并且每一行最初都是从Python dict中的数据序列化的,例如:

There's a logfile with text in the form of space-separated key=value pairs, and each line was originally serialized from data in a Python dict, something like:

' '.join([f'{k}={v!r}' for k,v in d.items()])

键始终只是字符串.值可以是 ast.literal_eval 可以成功解析的任何值,不多不少.

The keys are always just strings. The values could be anything that ast.literal_eval can successfully parse, no more no less.

如何处理此日志文件并将行转换回Python字典?示例:

>>> to_dict("key='hello world'")
{'key': 'hello world'}

>>> to_dict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}

>>> to_dict("s='1234' n=1234")
{'s': '1234', 'n': 1234}

>>> to_dict("""k4='k5="hello"' k5={'k6': ['potato']}""")
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}

以下是有关数据的一些额外信息:

Here is some extra context about the data:

  • 键是有效名称
  • 输入行格式正确(例如,没有悬挂的括号)
  • 数据是受信任的(不安全的函数,例如 eval exec yaml.load都可以使用)
  • 顺序并不重要.性能并不重要.正确性很重要.
  • Keys are valid names
  • Input lines are well-formed (e.g. no dangling brackets)
  • The data is trusted (unsafe functions such as eval, exec, yaml.load are OK to use)
  • Order is not important. Performance is not important. Correctness is important.

:根据注释的要求,这是MCVE和无法正常运行的示例代码

As requested in the comments, here is an MCVE and an example code that didn't work correctly

>>> def to_dict(s):
...     s = s.replace(' ', ', ')
...     return eval(f"dict({s})")
... 
... 
>>> to_dict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}  # OK
>>> to_dict("s='1234' n=1234")
{'s': '1234', 'n': 1234}  # OK
>>> to_dict("key='hello world'")
{'key': 'hello, world'}  # Incorrect, the value was corrupted

推荐答案

ast.literal_eval之类的内容无法方便地解析您的输入,但是 可以是

Your input can't be conveniently parsed by something like ast.literal_eval, but it can be tokenized as a series of Python tokens. This makes things a bit easier than they might otherwise be.

=令牌在输入中唯一可以出现的地方是键值分隔符;至少到目前为止,ast.literal_eval不接受带有=令牌的任何内容.我们可以使用=令牌来确定键值对在何处开始和结束,而其余大部分工作都可以由ast.literal_eval处理.使用tokenize模块还可以避免在字符串文字中出现=或反斜杠转义的问题.

The only place = tokens can appear in your input is as key-value separators; at least for now, ast.literal_eval doesn't accept anything with = tokens in it. We can use the = tokens to determine where the key-value pairs start and end, and most of the rest of the work can be handled by ast.literal_eval. Using the tokenize module also avoids problems with = or backslash escapes in string literals.

import ast
import io
import tokenize

def todict(logstring):
    # tokenize.tokenize wants an argument that acts like the readline method of a binary
    # file-like object, so we have to do some work to give it that.
    input_as_file = io.BytesIO(logstring.encode('utf8'))
    tokens = list(tokenize.tokenize(input_as_file.readline))

    eqsign_locations = [i for i, token in enumerate(tokens) if token[1] == '=']

    names = [tokens[i-1][1] for i in eqsign_locations]

    # Values are harder than keys.
    val_starts = [i+1 for i in eqsign_locations]
    val_ends = [i-1 for i in eqsign_locations[1:]] + [len(tokens)]

    # tokenize.untokenize likes to add extra whitespace that ast.literal_eval
    # doesn't like. Removing the row/column information from the token records
    # seems to prevent extra leading whitespace, but the documentation doesn't
    # make enough promises for me to be comfortable with that, so we call
    # strip() as well.
    val_strings = [tokenize.untokenize(tok[:2] for tok in tokens[start:end]).strip()
                   for start, end in zip(val_starts, val_ends)]
    vals = [ast.literal_eval(val_string) for val_string in val_strings]

    return dict(zip(names, vals))

这在您的示例输入以及带有反斜杠的示例中均正确运行:

This behaves correctly on your example inputs, as well as on an example with backslashes:

>>> todict("key='hello world'")
{'key': 'hello world'}
>>> todict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}
>>> todict("s='1234' n=1234")
{'s': '1234', 'n': 1234}
>>> todict("""k4='k5="hello"' k5={'k6': ['potato']}""")
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
>>> s=input()
a='=' b='"\'' c=3
>>> todict(s)
{'a': '=', 'b': '"\'', 'c': 3}

顺便说一句,我们可能会寻找令牌类型NAME而不是=令牌,但是如果它们向literal_eval添加set()支持,那将会中断.寻找=将来也可能会失败,但是看起来并不像寻找NAME令牌那样容易失败.

Incidentally, we probably could look for token type NAME instead of = tokens, but that'll break if they ever add set() support to literal_eval. Looking for = could also break in the future, but it doesn't seem as likely to break as looking for NAME tokens.

这篇关于将键=值对转换回Python字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆