通过嵌套dict中的键路径映射功能,包括切片,通配符和参差不齐的层次结构 [英] Map a function by key path in nested dict including slices, wildcards and ragged hierarchies
问题描述
此问题是基于此处.
在嵌套字典中将函数映射到指定键路径的好方法是什么,包括以下路径说明:
What is a good approach to mapping a function to a specified key path in nested dicts, including these path specification:
- 给定路径位置的键列表
- 键片(假设排序)
- 通配符(即路径位置上的所有键)
- 通过忽略未出现在给定级别的键来处理参差不齐的层次结构
如果它比较简单,则可以假定仅嵌套字典,不包含字典列表,因为可以使用dict(enumerate(...))
获得前者.
If it is makes is simpler, can assume that only dicts are nested, no lists of dicts, since the former can be obtained with dict(enumerate(...))
.
但是,层次结构可能参差不齐,例如:
However, the hierarchy can be ragged, eg:
data = {0: {'a': 1, 'b': 2},
1: {'a': 10, 'c': 13},
2: {'a': 20, 'b': {'d': 100, 'e': 101}, 'c': 23},
3: {'a': 30, 'b': 31, 'c': {'d': 300}}}
希望能够这样指定键路径:
Would like to be able to specify key path like this:
map_at(f, ['*',['b','c'],'d'])
要返回:
{0: {'a': 1, 'b': 2},
1: {'a': 10, 'c': 13},
2: {'a': 20, 'b': {'d': f(100), 'e': 101}, 'c': 23},
3: {'a': 30, 'b': 31, 'c': {'d': f(300)}}}
此处f
映射到键路径[2,b,d]
和[3,c,d]
.
Here f
is mapped to key paths [2,b,d]
and [3,c,d]
.
例如,将切片指定为[0:3,b]
.
我认为路径规范是明确的,尽管可以概括为例如匹配键路径前缀(在这种情况下,f
也将映射到[0,b]`和其他路径).
I think the path spec is unambiguous, though could be generalized to, for example, match key path prefix (in which case, f
would also be mapped at [0,b]` and other paths).
这可以通过理解和递归来实现吗,还是需要大量的工作才能捕捉到KeyError
等?
Can this be implemented via comprehension and recursion or does it require heavy lifting to catch KeyError
etc?
请不要建议使用熊猫作为替代品.
Please do not suggest Pandas as an alternative.
推荐答案
我不是伪代码的忠实拥护者,但是在这种情况下,您需要写下一个算法.这是我对您的要求的理解:
I'm not a big fan of pseudo-code, but in this kind of situation, you need to write down an algorithm. Here's my understanding of your requirements:
map_at(func, path_pattern, data)
:
- 如果
path_pattern
不为空- 如果
data
是终端,则失败:我们没有匹配完整的path_pattern
̀,因此没有理由应用该功能.只需返回data
. - 否则,我们必须探索数据中的每条路径.如果可能,我们消耗
path_pattern
的头部.那会返回一个dictdata key
->map_at(func, new_path, data value)
,如果键与head
匹配,则new_path
是path_pattern
的tail
,否则是path_pattern本身.
- 如果
- if
path_pattern
is not empty- if
data
is terminal, it's a failure : we did not match the fullpath_pattern
̀so there is no reason to apply the function. Just returndata
. - else, we have to explore every path in data. We consume the head of
path_pattern
if possible. That is return a dictdata key
->map_at(func, new_path, data value)
wherenew_path
is thetail
of thepath_pattern
if the key matches thehead
, else the `path_pattern itself.
- if
- 如果
data
是终端,则返回func(data)
- 否则,找到叶子并应用
func
:return返回一则字典data key
->map_at(func, [], data value)
- if
data
is terminal, returnfunc(data)
- else, find the leaves and apply
func
: return return a dictdata key
->map_at(func, [], data value)
注意:
- 我假定模式
*-b-d
与路径0-a-b-c-d-e
相匹配; - 这是一个渴望的算法:在可能的情况下,始终消耗路径的开头;
- 如果路径已被完全消耗,则应映射每个终端;
- 这是一个简单的DFS,因此我想可以用堆栈编写一个迭代版本.
- I assume that the pattern
*-b-d
matches the path0-a-b-c-d-e
; - it's an eager algorithm: the head of the path is always consumed when possible;
- if the path is fully consumed, every terminal should be mapped;
- it's a simple DFS, thus I guess it's possible to write an iterative version with a stack.
代码如下:
def map_at(func, path_pattern, data):
def matches(pattern, value):
try:
return pattern == '*' or value == pattern or value in pattern
except TypeError: # EDIT: avoid "break" in the dict comprehension if pattern is not a list.
return False
if path_pattern:
head, *tail = path_pattern
try: # try to consume head for each key of data
return {k: map_at(func, tail if matches(head, k) else path_pattern, v) for k,v in data.items()}
except AttributeError: # fail: terminal data but path_pattern was not consumed
return data
else: # success: path_pattern is empty.
try: # not a leaf: map every leaf of every path
return {k: map_at(func, [], v) for k,v in data.items()}
except AttributeError: # a leaf: map it
return func(data)
请注意,tail if matches(head, k) else path_pattern
表示:尽可能消耗head
.要在模式中使用范围,只需使用range(...)
.
Note that tail if matches(head, k) else path_pattern
means: consume head
if possible. To use a range in the pattern, just use range(...)
.
如您所见,您永远不会从情况2中逃脱:如果path_pattern
为空,那么无论发生什么情况,您都必须映射所有叶子.在此版本中,这一点更加清楚:
As you can see, you never escape from case 2. : if the path_pattern
is empty, you just have to map all leaves whatever happens. This is clearer in this version:
def map_all_leaves(func, data):
"""Apply func to all leaves"""
try:
return {k: map_all_leaves(func, v) for k,v in data.items()}
except AttributeError:
return func(data)
def map_at(func, path_pattern, data):
def matches(pattern, value):
try:
return pattern == '*' or value == pattern or value in pattern
except TypeError: # EDIT: avoid "break" in the dict comprehension if pattern is not a list.
return False
if path_pattern:
head, *tail = path_pattern
try: # try to consume head for each key of data
return {k: map_at(func, tail if matches(head, k) else path_pattern, v) for k,v in data.items()}
except AttributeError: # fail: terminal data but path_pattern is not consumed
return data
else:
map_all_leaves(func, data)
编辑
如果要处理列表,可以尝试以下操作:
If you want to handle lists, you can try this:
def map_at(func, path_pattern, data):
def matches(pattern, value):
try:
return pattern == '*' or value == pattern or value in pattern
except TypeError: # EDIT: avoid "break" in the dict comprehension if pattern is not a list.
return False
def get_items(data):
try:
return data.items()
except AttributeError:
try:
return enumerate(data)
except TypeError:
raise
if path_pattern:
head, *tail = path_pattern
try: # try to consume head for each key of data
return {k: map_at(func, tail if matches(head, k) else path_pattern, v) for k,v in get_items(data)}
except TypeError: # fail: terminal data but path_pattern was not consumed
return data
else: # success: path_pattern is empty.
try: # not a leaf: map every leaf of every path
return {k: map_at(func, [], v) for k,v in get_items(data)}
except TypeError: # a leaf: map it
return func(data)
这个想法很简单:enumerate
等同于dict.items
的列表:
The idea is simple: enumerate
is the equivalent for a list of dict.items
:
>>> list(enumerate(['a', 'b']))
[(0, 'a'), (1, 'b')]
>>> list({0:'a', 1:'b'}.items())
[(0, 'a'), (1, 'b')]
因此,get_items
只是返回dict项目,列表项目(索引,值)或引发错误的包装器.
Hence, get_items
is just a wrapper to return the dict items, the list items (index, value) or raise an error.
缺陷在于列表在此过程中会转换为字典:
The flaw is that lists are converted to dicts in the process:
>>> data2 = [{'a': 1, 'b': 2}, {'a': 10, 'c': 13}, {'a': 20, 'b': {'d': 100, 'e': 101}, 'c': 23}, {'a': 30, 'b': 31, 'c': {'d': 300}}]
>>> map_at(type,['*',['b','c'],'d'],data2)
{0: {'a': 1, 'b': 2}, 1: {'a': 10, 'c': 13}, 2: {'a': 20, 'b': {'d': <class 'int'>, 'e': 101}, 'c': 23}, 3: {'a': 30, 'b': 31, 'c': {'d': <class 'int'>}}}
编辑
由于您正在寻找类似Xpath的JSON,因此可以尝试 https://pypi.org/project/jsonpath/或 https://pypi.org/project/jsonpath-rw/. (我没有测试那些库).
Since you are looking for something like Xpath for JSON, you could try https://pypi.org/project/jsonpath/ or https://pypi.org/project/jsonpath-rw/. (I did not test those libs).
这篇关于通过嵌套dict中的键路径映射功能,包括切片,通配符和参差不齐的层次结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!