实现嵌套字典的最佳方法是什么? [英] What is the best way to implement nested dictionaries?

查看:37
本文介绍了实现嵌套字典的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据结构,它本质上相当于一个嵌套字典.假设它看起来像这样:

{'new jersey': {'mercer County': {'plumbers': 3,'程序员':81},'中性县':{'程序员':81,'推销员':62}},'纽约':{'皇后县':{'水管工':9,'推销员':36}}}

现在,维护和创建它是非常痛苦的;每次我有一个新的州/县/专业时,我都必须通过令人讨厌的 try/catch 块创建下层词典.此外,如果我想查看所有值,我必须创建烦人的嵌套迭代器.

我也可以使用元组作为键,例如:

{('新泽西', '默瑟县', '水管工'): 3,('新泽西'、'默瑟县'、'程序员'):81,(新泽西"、米德尔塞克斯郡"、程序员"):81,('新泽西', '米德尔塞克斯县', '推销员'): 62,('纽约', '皇后区', '水管工'): 9,('纽约', '皇后区', '推销员'): 36}

这使得对值的迭代变得非常简单和自然,但是执行诸如聚合和查看字典的子集之类的操作在语法上更加痛苦(例如,如果我只想逐个进行状态).

基本上,有时我想将嵌套字典视为平面字典,有时我想将其视为复杂的层次结构.我可以把这一切都包装在一个班级中,但似乎有人可能已经这样做了.或者,似乎有一些非常优雅的语法结构可以做到这一点.

我怎样才能做得更好?

附录:我知道 setdefault() 但它并没有真正实现干净的语法.此外,您创建的每个子词典仍然需要手动设置 setdefault().

解决方案

在 Python 中实现嵌套字典的最佳方法是什么?

这是个坏主意,不要这样做.相反,使用常规字典并在适当的地方使用 dict.setdefault,因此当在正常使用情况下缺少键时,您会得到预期的 KeyError.如果您坚持要采取这种行为,请按以下步骤对自己开枪:

dict 子类上实现 __missing__ 以设置和返回一个新实例.

自 Python 2.5 起,此方法已(并记录在案), 并且(对我来说特别有价值)它的打印效果就像一个普通的 dict,而不是一个自动激活的 defaultdict 的丑陋打印:

class Vividic(dict):def __missing__(self, key):value = self[key] = type(self)() # 保留本地指向值的指针返回值 # 比字典查找更快返回

(注意 self[key] 在赋值的左边,所以这里没有递归.)

假设你有一些数据:

data = {('new jersey', 'mercer County', 'plumbers'): 3,('新泽西'、'默瑟县'、'程序员'):81,(新泽西"、米德尔塞克斯郡"、程序员"):81,('新泽西', '米德尔塞克斯县', '推销员'): 62,('纽约', '皇后区', '水管工'): 9,('纽约', '皇后区', '推销员'): 36}

这是我们的使用代码:

vividic = Vividic()对于(州、县、职业),data.items() 中的数字:vividict[州][县][职业]=数字

现在:

<预><代码>>>>导入打印>>>pprint.pprint(生动,宽度= 40){'新泽西':{'默瑟县':{'水管工':3,'程序员':81},'中性县':{'程序员':81,'推销员':62}},'纽约':{'皇后县':{'水管工':9,'推销员':36}}}

批评

对这种容器的批评是,如果用户拼错了一个键,我们的代码可能会默默地失败:

<预><代码>>>>vividict['纽约']['皇后区']{}

另外,现在我们的数据中有一个拼写错误的县:

<预><代码>>>>pprint.pprint(生动,宽度= 40){'新泽西':{'默瑟县':{'水管工':3,'程序员':81},'中性县':{'程序员':81,'推销员':62}},'纽约':{'皇后县':{'水管工':9,'推销员':36},皇后区":{}}}

说明:

我们只是提供我们的类 Vividic 的另一个嵌套实例,每当一个键被访问但丢失时.(返回赋值很有用,因为它避免了我们额外调用 dict 上的 getter,不幸的是,我们不能在它被设置时返回它.)

请注意,这些语义与最受好评的答案相同,但只有一半的代码行 - nosklo 的实现:

<块引用>

class AutoVivification(dict):""""perl 自动激活功能的实现."""def __getitem__(self, item):尝试:返回 dict.__getitem__(self, item)除了 KeyError:值 = self[item] = type(self)()返回值

使用演示

以下只是一个示例,说明如何轻松地使用此 dict 动态创建嵌套的 dict 结构.这可以根据您的需要快速创建层次树结构.

导入pprint类 Vividic(dict):def __missing__(self, key):value = self[key] = type(self)()返回值d = Vividic()d['foo']['bar']d['foo']['baz']d['嘶嘶']['嗡嗡']d['初级']['二级']['三级']['四级']pprint.pprint(d)

输出:

{'fizz': {'buzz': {}},'foo': {'bar': {}, 'baz': {}},'primary': {'secondary': {'tertiary': {'quaternary': {}}}}}

正如最后一行所示,它的印刷精美,便于人工检查.但是,如果您想直观地检查您的数据,实施 __missing__ 以将其类的新实例设置为键并返回它是一个更好的解决方案.

其他替代方案,作为对比:

dict.setdefault

尽管提问者认为这不干净,但我发现它比我自己的 Vividic 更可取.

d = {} # 或 dict()对于(州、县、职业),data.items() 中的数字:d.setdefault(state, {}).setdefault(county, {})[occupation] = number

现在:

<预><代码>>>>pprint.pprint(d, 宽度=40){'新泽西':{'默瑟县':{'水管工':3,'程序员':81},'中性县':{'程序员':81,'推销员':62}},'纽约':{'皇后县':{'水管工':9,'推销员':36}}}

拼写错误会很吵,而且不会用错误的信息混淆我们的数据:

<预><代码>>>>d['纽约']['皇后区']回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中.KeyError: '皇后区'

此外,我认为 setdefault 在循环中使用时效果很好,而且你不知道你会得到什么键,但重复使用变得相当繁重,我认为没有人会想要跟上以下:

d = dict()d.setdefault('foo', {}).setdefault('bar', {})d.setdefault('foo', {}).setdefault('baz', {})d.setdefault('fizz', {}).setdefault('buzz', {})d.setdefault('primary', {}).setdefault('secondary', {}).setdefault('tertiary', {}).setdefault('quaternary', {})

另一个批评是无论是否使用 setdefault 都需要一个新实例.然而,Python(或至少是 CPython)在处理未使用和未引用的新实例方面相当聪明,例如,它重用内存中的位置:

<预><代码>>>>id({})、id({})、id({})(523575344, 523575344, 523575344)

自动激活的 defaultdict

这是一个看起来很整洁的实现,在您不检查数据的脚本中使用将与实现 __missing__ 一样有用:

from collections import defaultdictdef vivdict():返回 defaultdict(vivdict)

但如果您需要检查数据,以相同方式填充数据的自动激活 defaultdict 的结果如下所示:

<预><代码>>>>d = vivdict();d['foo']['bar'];d['foo']['baz'];d['嘶嘶']['嗡嗡声'];d['primary']['secondary']['tertiary']['quaternary'];导入打印;>>>pprint.pprint(d)defaultdict(, {'foo': defaultdict(, {'baz': defaultdict(, {}), 'bar':defaultdict(, {})}), 'primary': defaultdict(, {'secondary': defaultdict(,{'tertiary': defaultdict(, {'quaternary': defaultdict(<function vivdict at 0x17B01870>, {})})})}), 'fizz': defaultdict(<function vivdict at0x17B01870>, {'buzz': defaultdict(, {})})})

这个输出很不雅观,结果很不可读.通常给出的解决方案是递归转换回 dict 以进行手动检查.这个重要的解决方案留给读者作为练习.

性能

最后,让我们看看性能.我正在减去实例化的成本.

<预><代码>>>>导入时间>>>min(timeit.repeat(lambda: {}.setdefault('foo', {}))) - min(timeit.repeat(lambda: {}))0.13612580299377441>>>min(timeit.repeat(lambda: vivdict()['foo'])) - min(timeit.repeat(lambda: vivdict()))0.2936999797821045>>>min(timeit.repeat(lambda: Vividic()['foo'])) - min(timeit.repeat(lambda: Vividic()))0.5354437828063965>>>min(timeit.repeat(lambda: AutoVivification()['foo'])) - min(timeit.repeat(lambda: AutoVivification()))2.138362169265747

基于性能,dict.setdefault 效果最好.如果您关心执行速度,我强烈建议将它用于生产代码.

如果您需要将其用于交互式使用(可能在 IPython 笔记本中),那么性能并不重要 - 在这种情况下,我会使用 Vividic 以提高输出的可读性.与 AutoVivification 对象(使用 __getitem__ 而不是为此目的而制作的 __missing__ )相比,它要优越得多.

结论

在子类 dict 上实现 __missing__ 以设置和返回新实例比替代方案稍微困难一些,但具有

的好处
  • 简单的实例化
  • 简单的数据填充
  • 轻松查看数据

并且由于它比修改 __getitem__ 更简单且性能更高,因此应该优先于该方法.

尽管如此,它也有缺点:

  • 错误查找将无声无息地失败.
  • 错误的查找将保留在字典中.

因此,与其他解决方案相比,我个人更喜欢 setdefault,并且在我需要这种行为的每种情况下都有.

I have a data structure which essentially amounts to a nested dictionary. Let's say it looks like this:

{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

Now, maintaining and creating this is pretty painful; every time I have a new state/county/profession I have to create the lower layer dictionaries via obnoxious try/catch blocks. Moreover, I have to create annoying nested iterators if I want to go over all the values.

I could also use tuples as keys, like such:

{('new jersey', 'mercer county', 'plumbers'): 3,
 ('new jersey', 'mercer county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'salesmen'): 62,
 ('new york', 'queens county', 'plumbers'): 9,
 ('new york', 'queens county', 'salesmen'): 36}

This makes iterating over the values very simple and natural, but it is more syntactically painful to do things like aggregations and looking at subsets of the dictionary (e.g. if I just want to go state-by-state).

Basically, sometimes I want to think of a nested dictionary as a flat dictionary, and sometimes I want to think of it indeed as a complex hierarchy. I could wrap this all in a class, but it seems like someone might have done this already. Alternatively, it seems like there might be some really elegant syntactical constructions to do this.

How could I do this better?

Addendum: I'm aware of setdefault() but it doesn't really make for clean syntax. Also, each sub-dictionary you create still needs to have setdefault() manually set.

解决方案

What is the best way to implement nested dictionaries in Python?

This is a bad idea, don't do it. Instead, use a regular dictionary and use dict.setdefault where apropos, so when keys are missing under normal usage you get the expected KeyError. If you insist on getting this behavior, here's how to shoot yourself in the foot:

Implement __missing__ on a dict subclass to set and return a new instance.

This approach has been available (and documented) since Python 2.5, and (particularly valuable to me) it pretty prints just like a normal dict, instead of the ugly printing of an autovivified defaultdict:

class Vividict(dict):
    def __missing__(self, key):
        value = self[key] = type(self)() # retain local pointer to value
        return value                     # faster to return than dict lookup

(Note self[key] is on the left-hand side of assignment, so there's no recursion here.)

and say you have some data:

data = {('new jersey', 'mercer county', 'plumbers'): 3,
        ('new jersey', 'mercer county', 'programmers'): 81,
        ('new jersey', 'middlesex county', 'programmers'): 81,
        ('new jersey', 'middlesex county', 'salesmen'): 62,
        ('new york', 'queens county', 'plumbers'): 9,
        ('new york', 'queens county', 'salesmen'): 36}

Here's our usage code:

vividict = Vividict()
for (state, county, occupation), number in data.items():
    vividict[state][county][occupation] = number

And now:

>>> import pprint
>>> pprint.pprint(vividict, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

Criticism

A criticism of this type of container is that if the user misspells a key, our code could fail silently:

>>> vividict['new york']['queens counyt']
{}

And additionally now we'd have a misspelled county in our data:

>>> pprint.pprint(vividict, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36},
              'queens counyt': {}}}

Explanation:

We're just providing another nested instance of our class Vividict whenever a key is accessed but missing. (Returning the value assignment is useful because it avoids us additionally calling the getter on the dict, and unfortunately, we can't return it as it is being set.)

Note, these are the same semantics as the most upvoted answer but in half the lines of code - nosklo's implementation:

class AutoVivification(dict):
    """Implementation of perl's autovivification feature."""
    def __getitem__(self, item):
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            value = self[item] = type(self)()
            return value

Demonstration of Usage

Below is just an example of how this dict could be easily used to create a nested dict structure on the fly. This can quickly create a hierarchical tree structure as deeply as you might want to go.

import pprint

class Vividict(dict):
    def __missing__(self, key):
        value = self[key] = type(self)()
        return value

d = Vividict()

d['foo']['bar']
d['foo']['baz']
d['fizz']['buzz']
d['primary']['secondary']['tertiary']['quaternary']
pprint.pprint(d)

Which outputs:

{'fizz': {'buzz': {}},
 'foo': {'bar': {}, 'baz': {}},
 'primary': {'secondary': {'tertiary': {'quaternary': {}}}}}

And as the last line shows, it pretty prints beautifully and in order for manual inspection. But if you want to visually inspect your data, implementing __missing__ to set a new instance of its class to the key and return it is a far better solution.

Other alternatives, for contrast:

dict.setdefault

Although the asker thinks this isn't clean, I find it preferable to the Vividict myself.

d = {} # or dict()
for (state, county, occupation), number in data.items():
    d.setdefault(state, {}).setdefault(county, {})[occupation] = number

and now:

>>> pprint.pprint(d, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

A misspelling would fail noisily, and not clutter our data with bad information:

>>> d['new york']['queens counyt']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'queens counyt'

Additionally, I think setdefault works great when used in loops and you don't know what you're going to get for keys, but repetitive usage becomes quite burdensome, and I don't think anyone would want to keep up the following:

d = dict()

d.setdefault('foo', {}).setdefault('bar', {})
d.setdefault('foo', {}).setdefault('baz', {})
d.setdefault('fizz', {}).setdefault('buzz', {})
d.setdefault('primary', {}).setdefault('secondary', {}).setdefault('tertiary', {}).setdefault('quaternary', {})

Another criticism is that setdefault requires a new instance whether it is used or not. However, Python (or at least CPython) is rather smart about handling unused and unreferenced new instances, for example, it reuses the location in memory:

>>> id({}), id({}), id({})
(523575344, 523575344, 523575344)

An auto-vivified defaultdict

This is a neat looking implementation, and usage in a script that you're not inspecting the data on would be as useful as implementing __missing__:

from collections import defaultdict

def vivdict():
    return defaultdict(vivdict)

But if you need to inspect your data, the results of an auto-vivified defaultdict populated with data in the same way looks like this:

>>> d = vivdict(); d['foo']['bar']; d['foo']['baz']; d['fizz']['buzz']; d['primary']['secondary']['tertiary']['quaternary']; import pprint; 
>>> pprint.pprint(d)
defaultdict(<function vivdict at 0x17B01870>, {'foo': defaultdict(<function vivdict 
at 0x17B01870>, {'baz': defaultdict(<function vivdict at 0x17B01870>, {}), 'bar': 
defaultdict(<function vivdict at 0x17B01870>, {})}), 'primary': defaultdict(<function 
vivdict at 0x17B01870>, {'secondary': defaultdict(<function vivdict at 0x17B01870>, 
{'tertiary': defaultdict(<function vivdict at 0x17B01870>, {'quaternary': defaultdict(
<function vivdict at 0x17B01870>, {})})})}), 'fizz': defaultdict(<function vivdict at 
0x17B01870>, {'buzz': defaultdict(<function vivdict at 0x17B01870>, {})})})

This output is quite inelegant, and the results are quite unreadable. The solution typically given is to recursively convert back to a dict for manual inspection. This non-trivial solution is left as an exercise for the reader.

Performance

Finally, let's look at performance. I'm subtracting the costs of instantiation.

>>> import timeit
>>> min(timeit.repeat(lambda: {}.setdefault('foo', {}))) - min(timeit.repeat(lambda: {}))
0.13612580299377441
>>> min(timeit.repeat(lambda: vivdict()['foo'])) - min(timeit.repeat(lambda: vivdict()))
0.2936999797821045
>>> min(timeit.repeat(lambda: Vividict()['foo'])) - min(timeit.repeat(lambda: Vividict()))
0.5354437828063965
>>> min(timeit.repeat(lambda: AutoVivification()['foo'])) - min(timeit.repeat(lambda: AutoVivification()))
2.138362169265747

Based on performance, dict.setdefault works the best. I'd highly recommend it for production code, in cases where you care about execution speed.

If you need this for interactive use (in an IPython notebook, perhaps) then performance doesn't really matter - in which case, I'd go with Vividict for readability of the output. Compared to the AutoVivification object (which uses __getitem__ instead of __missing__, which was made for this purpose) it is far superior.

Conclusion

Implementing __missing__ on a subclassed dict to set and return a new instance is slightly more difficult than alternatives but has the benefits of

  • easy instantiation
  • easy data population
  • easy data viewing

and because it is less complicated and more performant than modifying __getitem__, it should be preferred to that method.

Nevertheless, it has drawbacks:

  • Bad lookups will fail silently.
  • The bad lookup will remain in the dictionary.

Thus I personally prefer setdefault to the other solutions, and have in every situation where I have needed this sort of behavior.

这篇关于实现嵌套字典的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆