使用 Python 规范化 JSON [英] Normalize JSON using Python

查看:41
本文介绍了使用 Python 规范化 JSON的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 JSONPython 比较陌生,并且自最近两天以来我一直在努力扁平化 JSON.我在

I am relatively new to JSON and Python, and I am struggling to flatten JSON since last two days. I read the example at http://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.io.json.json_normalize.html, but I didn't understand how to unlist some nested elements. I also read a few threads Flatten JSON based on an attribute - python How to normalize complex nested json in python? and https://towardsdatascience.com/flattening-json-objects-in-python-f5343c794b10. I tried all without any luck.

Here's the first record of my JSON file:

d = 
{'city': {'url': 'link',
  'name': ['San Francisco']},
 'rank': 1,
 'resident': [
  {'link': ['bit.ly/0842/'], 'name': ['John A']},
  {'link': ['bit.ly/5835/'], 'name': ['Tedd B']},
  {'link': ['bit.ly/2011/'], 'name': ['Cobb C']},
  {'link': ['bit.ly/0855/'], 'name': ['Jack N']},
  {'link': ['bit.ly/1430/'], 'name': ['Jack K']},
  {'link': ['bit.ly/3081/'], 'name': ['Edward']},
  {'link': ['bit.ly/2001/'], 'name': ['Jack W']},
  {'link': ['bit.ly/0020/'], 'name': ['Henry F']},
  {'link': ['bit.ly/2137/'], 'name': ['Joseph S']},
  {'link': ['bit.ly/3225/'], 'name': ['Ed B']},
  {'link': ['bit.ly/3667/'], 'name': ['George Vvec']},
  {'link': ['bit.ly/6434/'], 'name': ['Robert W']},
  {'link': ['bit.ly/4036/'], 'name': ['Rudy B']},
  {'link': ['bit.ly/6450/'], 'name': ['James K']},
  {'link': ['bit.ly/5180/'], 'name': ['Billy N']},
  {'link': ['bit.ly/7847/'], 'name': ['John S']}]
}

Here's the expected output:

city_url  city_name      rank    resident_link   resident_name  
link      San Francisco   1     'bit.ly/0842/'   'John A'
link      San Francisco   1     'bit.ly/5835/'   'Tedd B'
link      San Francisco   1     'bit.ly/2011/'   'Cobb C'
link      San Francisco   1     'bit.ly/0855/'   'Jack N'
link      San Francisco   1     'bit.ly/1430/'   'Jack K'
link      San Francisco   1     'bit.ly/3081/'   'Edward'
link      San Francisco   1     'bit.ly/2001/'   'Jack W'
link      San Francisco   1     'bit.ly/0020/'   'Henry F'
link      San Francisco   1     'bit.ly/2137/'   'Joseph S'
link      San Francisco   1     'bit.ly/3225/'   'Ed B'
link      San Francisco   1     'bit.ly/3667/'   'George Vvec'
link      San Francisco   1     'bit.ly/6434/'   'Robert W'
link      San Francisco   1     'bit.ly/4036/'   'Rudy B'
link      San Francisco   1     'bit.ly/6450/'   'James K'
link      San Francisco   1     'bit.ly/5180/'   'Billy N'
link      San Francisco   1     'bit.ly/7847/'   'John S'

The flatten_json() function (from Medium.com above) destroys the hierarchy. Here are first few rows:

{'city_url': 'link',
 'city_name_0': 'San Francisco',
 'rank': 1,
 'resident_0_link_0': 'bit.ly/0842/',
 'resident_0_name_0': 'John A', ...

Can someone please help me how to think about converting these datasets? Unfortunately, pandas documentation provides no guidance for beginners. HEre's what I was playing with. Nothing worked.

from pandas.io.json import json_normalize
json_normalize(d,['city',['name','rank']])
json_normalize(d,['city','name','rank'])
json_normalize(d,['city','name'])

I'd appreciate if someone guide how to do these type of conversion and the thought process.

Also, I'm looking for a vectorized operation or O(N) operation rather than O(N2) because of the amount of data in the original dataset. Hence, anything slower than O(N) won't work.

解决方案

If you know the structure of the json blob this will do it

resident_link = [k['link'][0] for k in d['resident']]
resident_name = [k['name'][0] for k in d['resident']]
n = len(d['resident'])
city_url = n * [d['city']['url']]
city_name = n * [d['city']['name'][0]]
rank = n * [d['rank']]

df = pandas.DataFrame({
    'resident_name' : resident_name,
    'resident_link' : resident_link,
    'city_url' : city_url,
    'city_name' : city_name,
    'rank' : rank
})

Which produces

        city_name city_url  rank resident_link resident_name
0   San Francisco     link     1  bit.ly/0842/        John A
1   San Francisco     link     1  bit.ly/5835/        Tedd B
2   San Francisco     link     1  bit.ly/2011/        Cobb C
3   San Francisco     link     1  bit.ly/0855/        Jack N
4   San Francisco     link     1  bit.ly/1430/        Jack K
5   San Francisco     link     1  bit.ly/3081/        Edward
6   San Francisco     link     1  bit.ly/2001/        Jack W
7   San Francisco     link     1  bit.ly/0020/       Henry F
8   San Francisco     link     1  bit.ly/2137/      Joseph S
9   San Francisco     link     1  bit.ly/3225/          Ed B
10  San Francisco     link     1  bit.ly/3667/   George Vvec
11  San Francisco     link     1  bit.ly/6434/      Robert W
12  San Francisco     link     1  bit.ly/4036/        Rudy B
13  San Francisco     link     1  bit.ly/6450/       James K
14  San Francisco     link     1  bit.ly/5180/       Billy N
15  San Francisco     link     1  bit.ly/7847/        John S


EDIT

As the OP say in the comments, imagine there's many records like this, each with the same structure

nrecords = 10
dd = {k : d for k in range(nrecords)}

dd now has 10 copies of the original json blob. And this is how the code should be updated

ff = pandas.DataFrame()

for record in range(nrecords):

    n = len(dd[record]['resident'])

    df = {
        'resident_link' : [k['link'][0] for k in dd[record]['resident']],
        'resident_name' : [k['name'][0] for k in dd[record]['resident']],
        'city_url' : n * [dd[record]['city']['url']],
        'city_name' : n * [dd[record]['city']['name'][0]],
        'rank' : n * [dd[record]['rank']]
        }

    df = pandas.DataFrame(df)
    ff = ff.append(df).reset_index(drop = True)

Below there's an estimation of running time as a function of number of records. Based on this it will take around 1 h to complete 1.5 million records

这篇关于使用 Python 规范化 JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆