使用pd.json_normalize展平字典 [英] Flattening dictionary with pd.json_normalize

查看:115
本文介绍了使用pd.json_normalize展平字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在整理该词典文件,并且遇到了很多障碍.我正在尝试使用 json_normalize 来展平此数据.如果我使用单个实例进行测试,则可以正常工作,但是如果我要整理所有数据,它将返回一个错误,指出 key错误'0',我不确定如何解决此问题.

I am currently working on flattening this dictionary file and have reached a number of road blocks. I am trying to use json_normalize to flatten this data. If I test with individual instances it works but if I want to flatten all the data it will return an error stating key error '0' I'm not sure how to fix this.

数据示例-

data = {1:{
      'Name': "Thrilling Tales of Dragon Slayers",
      'IDs':{
            "StoreID": ['123445452543'],
            "BookID": ['543533254353'],
            "SalesID": ['543267765345']},
     2:{
      'Name': "boring Tales of Dragon Slayers",
      'IDs':{
            "StoreID": ['111111', '1121111'],
            "BookID": ['543533254353', '4324232342'],
            "SalesID": ['543267765345', '4353543']}}

我的代码

d_flat = pd.io.json.json_normalize(data, meta=['Title', 'StoreID', 'BookID', 'SalesID'])

推荐答案

设置

您的数据结构不便.我想专注于:

Setup

Your data is structured inconveniently. I want to focus on:

  1. "ID" 中的列表放入字典列表中,这会更加方便.
  2. 摆脱父词典中无用的键.我们只关心价值.
  1. Getting the lists in 'IDs' into a list of dictionaries, which would be far more convenient.
  2. Getting rid of the useless keys in the parent dictionary. All we care about are the values.

您的数据:

{1: {'Name': 'Thrilling Tales of Dragon Slayers',
     'IDs': {'StoreID': ['123445452543'],
             'BookID': ['543533254353'],
             'SalesID': ['543267765345']}},
 2: {'Name': 'boring Tales of Dragon Slayers',
     'IDs': {'StoreID': ['111111', '1121111'],
             'BookID': ['543533254353', '4324232342'],
             'SalesID': ['543267765345', '4353543']}}}

我希望它看起来像什么

[{'Name': 'Thrilling Tales of Dragon Slayers',
  'IDs': [{'StoreID': '123445452543',
           'BookID': '543533254353',
           'SalesID': '543267765345'}]},
 {'Name': 'boring Tales of Dragon Slayers',
  'IDs': [{'StoreID': '111111',
           'BookID': '543533254353',
           'SalesID': '543267765345'},
          {'StoreID': '1121111',
           'BookID': '4324232342',
           'SalesID': '4353543'}]}]


重组数据

合理的方式

简单的循环,不要乱扔.这使我们得到了上面显示的内容


Restructure Data

Reasonable Way

Simple loop, don't mess around. This gets us what I showed above

new = []

for v in data.values():
    temp = {**v}           # This is intended to keep all the other data that might be there
    ids = temp.pop('IDs')  # I have to focus on this to create the records
    temp['IDs'] = [dict(zip(ids, x)) for x in zip(*ids.values())]
    new.append(temp)

可爱的单缸纸

new = [{**v, 'IDs': [dict(zip(v['IDs'], x)) for x in zip(*v['IDs'].values())]} for v in data.values()]

使用 pd.json_normalize

创建 DataFrame

在对 json_normalize 的调用中,我们需要指定记录的路径,即在'IDs'键处找到的ID词典列表. json_normalize 将在数据框中为该列表中的每个项目创建一行.这将通过 record_path 参数完成,我们传递一个 tuple 来描述路径(如果是更深层的结构)或字符串(如果键位于顶层,对我们来说就是.)

Create DataFrame with pd.json_normalize

In this call to json_normalize we need to specify the path to the records, i.e. the list of id dictionaries found at the 'IDs' key. json_normalize will create one row in the dataframe for every item in that list. This will be done with the the record_path parameter and we pass a tuple that describes the path (if it were in a deeper structure) or a string (if the key is at the top layer, which for us, it is).

record_path = 'IDs'

然后我们要告诉 json_normalize 哪些键是记录的元数据.如果像我们一样有多个记录,那么将为每条记录重复元数据.

Then we want to tell json_normalize what keys are metadata for the records. If there are more than one record, as we have, then the metadata will be repeated for each record.

meta = 'Name'

所以最终的解决方案是这样的:

So the final solution looks like this:

pd.json_normalize(new, record_path='IDs', meta='Name')

        StoreID        BookID       SalesID                               Name
0  123445452543  543533254353  543267765345  Thrilling Tales of Dragon Slayers
1        111111  543533254353  543267765345     boring Tales of Dragon Slayers
2       1121111    4324232342       4353543     boring Tales of Dragon Slayers


但是

如果我们仍然要进行重组,那么最好还是将其传递给数据框构造函数.


However

If we are restructuring anyway, might as well make it so we can just pass it to the dataframe constructor.

pd.DataFrame([
    {'Name': r['Name'], **dict(zip(r['IDs'], x))}
    for r in data.values() for x in zip(*r['IDs'].values())
])

                                Name       StoreID        BookID       SalesID
0  Thrilling Tales of Dragon Slayers  123445452543  543533254353  543267765345
1     boring Tales of Dragon Slayers        111111  543533254353  543267765345
2     boring Tales of Dragon Slayers       1121111    4324232342       4353543


奖励内容

虽然我们在做.关于每个id类型是否具有相同数量的id,数据是模棱两可的.假设他们没有.


Bonus Content

While we are at it. The data is ambiguous in regards to whether or not each id type has the same number of ids. Suppose they did not.

data = {1:{
      'Name': "Thrilling Tales of Dragon Slayers",
      'IDs':{
            "StoreID": ['123445452543'],
            "BookID": ['543533254353'],
            "SalesID": ['543267765345']}},
     2:{
      'Name': "boring Tales of Dragon Slayers",
      'IDs':{
            "StoreID": ['111111', '1121111'],
            "BookID": ['543533254353', '4324232342'],
            "SalesID": ['543267765345', '4353543', 'extra id']}}}

然后我们可以使用 itertools

from itertools import zip_longest

pd.DataFrame([
    {'Name': r['Name'], **dict(zip(r['IDs'], x))}
    for r in data.values() for x in zip_longest(*r['IDs'].values())
])

                                Name       StoreID        BookID       SalesID
0  Thrilling Tales of Dragon Slayers  123445452543  543533254353  543267765345
1     boring Tales of Dragon Slayers        111111  543533254353  543267765345
2     boring Tales of Dragon Slayers       1121111    4324232342       4353543
3     boring Tales of Dragon Slayers          None          None      extra id

这篇关于使用pd.json_normalize展平字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆