将键=值对转换回Python字典 [英] Converting key=value pairs back into Python dicts
问题描述
有一个日志文件,其中文本以空格分隔的key=value
对形式出现,并且每一行最初都是从Python dict中的数据序列化的,例如:
There's a logfile with text in the form of space-separated key=value
pairs, and each line was originally serialized from data in a Python dict, something like:
' '.join([f'{k}={v!r}' for k,v in d.items()])
键始终只是字符串.值可以是 ast.literal_eval
可以成功解析的任何值,不多不少.
The keys are always just strings. The values could be anything that ast.literal_eval
can successfully parse, no more no less.
如何处理此日志文件并将行转换回Python字典?示例:
>>> to_dict("key='hello world'")
{'key': 'hello world'}
>>> to_dict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}
>>> to_dict("s='1234' n=1234")
{'s': '1234', 'n': 1234}
>>> to_dict("""k4='k5="hello"' k5={'k6': ['potato']}""")
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
以下是有关数据的一些额外信息:
Here is some extra context about the data:
- Keys are valid names
- Input lines are well-formed (e.g. no dangling brackets)
- The data is trusted (unsafe functions such as
eval
,exec
,yaml.load
are OK to use) - Order is not important. Performance is not important. Correctness is important.
:根据注释的要求,这是MCVE和无法正常运行的示例代码
As requested in the comments, here is an MCVE and an example code that didn't work correctly
>>> def to_dict(s):
... s = s.replace(' ', ', ')
... return eval(f"dict({s})")
...
...
>>> to_dict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'} # OK
>>> to_dict("s='1234' n=1234")
{'s': '1234', 'n': 1234} # OK
>>> to_dict("key='hello world'")
{'key': 'hello, world'} # Incorrect, the value was corrupted
推荐答案
ast.literal_eval
之类的内容无法方便地解析您的输入,但是 可以是
Your input can't be conveniently parsed by something like ast.literal_eval
, but it can be tokenized as a series of Python tokens. This makes things a bit easier than they might otherwise be.
=
令牌在输入中唯一可以出现的地方是键值分隔符;至少到目前为止,ast.literal_eval
不接受带有=
令牌的任何内容.我们可以使用=
令牌来确定键值对在何处开始和结束,而其余大部分工作都可以由ast.literal_eval
处理.使用tokenize
模块还可以避免在字符串文字中出现=
或反斜杠转义的问题.
The only place =
tokens can appear in your input is as key-value separators; at least for now, ast.literal_eval
doesn't accept anything with =
tokens in it. We can use the =
tokens to determine where the key-value pairs start and end, and most of the rest of the work can be handled by ast.literal_eval
. Using the tokenize
module also avoids problems with =
or backslash escapes in string literals.
import ast
import io
import tokenize
def todict(logstring):
# tokenize.tokenize wants an argument that acts like the readline method of a binary
# file-like object, so we have to do some work to give it that.
input_as_file = io.BytesIO(logstring.encode('utf8'))
tokens = list(tokenize.tokenize(input_as_file.readline))
eqsign_locations = [i for i, token in enumerate(tokens) if token[1] == '=']
names = [tokens[i-1][1] for i in eqsign_locations]
# Values are harder than keys.
val_starts = [i+1 for i in eqsign_locations]
val_ends = [i-1 for i in eqsign_locations[1:]] + [len(tokens)]
# tokenize.untokenize likes to add extra whitespace that ast.literal_eval
# doesn't like. Removing the row/column information from the token records
# seems to prevent extra leading whitespace, but the documentation doesn't
# make enough promises for me to be comfortable with that, so we call
# strip() as well.
val_strings = [tokenize.untokenize(tok[:2] for tok in tokens[start:end]).strip()
for start, end in zip(val_starts, val_ends)]
vals = [ast.literal_eval(val_string) for val_string in val_strings]
return dict(zip(names, vals))
这在您的示例输入以及带有反斜杠的示例中均正确运行:
This behaves correctly on your example inputs, as well as on an example with backslashes:
>>> todict("key='hello world'")
{'key': 'hello world'}
>>> todict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}
>>> todict("s='1234' n=1234")
{'s': '1234', 'n': 1234}
>>> todict("""k4='k5="hello"' k5={'k6': ['potato']}""")
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
>>> s=input()
a='=' b='"\'' c=3
>>> todict(s)
{'a': '=', 'b': '"\'', 'c': 3}
顺便说一句,我们可能会寻找令牌类型NAME而不是=
令牌,但是如果它们向literal_eval
添加set()
支持,那将会中断.寻找=
将来也可能会失败,但是看起来并不像寻找NAME
令牌那样容易失败.
Incidentally, we probably could look for token type NAME instead of =
tokens, but that'll break if they ever add set()
support to literal_eval
. Looking for =
could also break in the future, but it doesn't seem as likely to break as looking for NAME
tokens.
这篇关于将键=值对转换回Python字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!