词典列表多VS词典列表少? [英] List with many dictionaries VS dictionary with few lists?

查看:103
本文介绍了词典列表多VS词典列表少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在像这样的数据集做一些练习:

I am doing some exercises with datasets like so:

列出了很多词典

users = [
    {"id": 0, "name": "Ashley"},
    {"id": 1, "name": "Ben"},
    {"id": 2, "name": "Conrad"},
    {"id": 3, "name": "Doug"},
    {"id": 4, "name": "Evin"},
    {"id": 5, "name": "Florian"},
    {"id": 6, "name": "Gerald"}
]

词典很少的列表

users2 = {
    "id": [0, 1, 2, 3, 4, 5, 6],
    "name": ["Ashley", "Ben", "Conrad", "Doug","Evin", "Florian", "Gerald"]
}

熊猫数据框

import pandas as pd
pd_users = pd.DataFrame(users)
pd_users2 = pd.DataFrame(users2)
print pd_users == pd_users2

问题:

  1. 我应该像用户还是像user2那样构造数据集?
  2. 是否存在性能差异?
  3. 一个比另一个更具可读性吗?
  4. 有没有我应该遵循的标准?
  5. 我通常将它们转换为pandas数据框.当我这样做时,两个版本都是相同的...对吗?
  6. 每个元素的输出都是正确的,因此,如果我使用panda df的权利没关系?

推荐答案

这与面向列的数据库和面向行.您的第一个示例是面向行的数据结构,第二个示例是面向列的数据结构.在特定的Python情况下,可以使用插槽使第一个更高效.列的字典不需要为每一行重复.

This relates to column oriented databases versus row oriented. Your first example is a row oriented data structure, and the second is column oriented. In the particular case of Python, the first could be made notably more efficient using slots, such that the dictionary of columns doesn't need to be duplicated for every row.

哪种格式效果更好,很大程度上取决于您对数据的处理方式;例如,如果您只访问任何行的所有行,则面向行是很自然的.同时,面向列的方式可以更好地利用缓存,例如在按特定字段进行搜索时(在Python中,可以通过大量使用引用来减少这种情况;引用类型如

Which form works better depends a lot on what you do with the data; for instance, row oriented is natural if you only ever access all of any row. Column oriented meanwhile makes much better use of caches and such when you're searching by a particular field (in Python, this may be reduced by the heavy use of references; types like array can optimize that). Traditional row oriented databases frequently use column oriented sorted indices to speed up lookups, and knowing these techniques you can implement any combination using a key-value store.

Pandas确实将两个示例都转换为相同格式,但是对于面向行的结构而言,转换本身的成本更高,这仅仅是因为必须阅读每个单独的字典.所有这些成本可能都是微不足道的.

Pandas does convert both your examples to the same format, but the conversion itself is more expensive for the row oriented structure, simply because every individual dictionary must be read. All of these costs may be marginal.

在您的示例中没有第三个选项:在这种情况下,您只有两列,其中一列是从0开始连续范围内的整数ID.这可以按条目本身的顺序存储,这意味着整个结构可以在您称为users2['name']的列表中找到;但值得注意的是,没有位置的条目是不完整的.该列表使用enumerate()转换为行.数据库通常也有这种特殊情况(例如,sqlite rowid ).

There's a third option not evident in your example: In this case, you only have two columns, one of which is an integer ID in a contiguous range from 0. This can be stored in the order of the entries itself, meaning the entire structure would be found in the list you've called users2['name']; but notably, the entries are incomplete without their position. The list translates into rows using enumerate(). It is common for databases to have this special case also (for instance, sqlite rowid).

通常,从一个数据结构开始,使您的代码保持敏感,并仅在您知道用例并且存在可衡量的性能问题时进行优化.诸如Pandas之类的工具可能意味着大多数项目都可以正常运行而无需微调.

In general, start with a data structure that keeps your code sensible, and optimize only when you know your use cases and have a measurable performance issue. Tools like Pandas probably means most projects will function just fine without finetuning.

这篇关于词典列表多VS词典列表少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆