使用 Python 规范化 JSON [英] Normalize JSON using Python
问题描述
我对 JSON
和 Python
比较陌生,并且自最近两天以来我一直在努力扁平化 JSON.我在
I am relatively new to JSON
and Python
, and I am struggling to flatten JSON since last two days.
I read the example at http://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.io.json.json_normalize.html, but I didn't understand how to unlist some nested elements. I also read a few threads Flatten JSON based on an attribute - python How to normalize complex nested json in python? and https://towardsdatascience.com/flattening-json-objects-in-python-f5343c794b10. I tried all without any luck.
Here's the first record of my JSON file:
d =
{'city': {'url': 'link',
'name': ['San Francisco']},
'rank': 1,
'resident': [
{'link': ['bit.ly/0842/'], 'name': ['John A']},
{'link': ['bit.ly/5835/'], 'name': ['Tedd B']},
{'link': ['bit.ly/2011/'], 'name': ['Cobb C']},
{'link': ['bit.ly/0855/'], 'name': ['Jack N']},
{'link': ['bit.ly/1430/'], 'name': ['Jack K']},
{'link': ['bit.ly/3081/'], 'name': ['Edward']},
{'link': ['bit.ly/2001/'], 'name': ['Jack W']},
{'link': ['bit.ly/0020/'], 'name': ['Henry F']},
{'link': ['bit.ly/2137/'], 'name': ['Joseph S']},
{'link': ['bit.ly/3225/'], 'name': ['Ed B']},
{'link': ['bit.ly/3667/'], 'name': ['George Vvec']},
{'link': ['bit.ly/6434/'], 'name': ['Robert W']},
{'link': ['bit.ly/4036/'], 'name': ['Rudy B']},
{'link': ['bit.ly/6450/'], 'name': ['James K']},
{'link': ['bit.ly/5180/'], 'name': ['Billy N']},
{'link': ['bit.ly/7847/'], 'name': ['John S']}]
}
Here's the expected output:
city_url city_name rank resident_link resident_name
link San Francisco 1 'bit.ly/0842/' 'John A'
link San Francisco 1 'bit.ly/5835/' 'Tedd B'
link San Francisco 1 'bit.ly/2011/' 'Cobb C'
link San Francisco 1 'bit.ly/0855/' 'Jack N'
link San Francisco 1 'bit.ly/1430/' 'Jack K'
link San Francisco 1 'bit.ly/3081/' 'Edward'
link San Francisco 1 'bit.ly/2001/' 'Jack W'
link San Francisco 1 'bit.ly/0020/' 'Henry F'
link San Francisco 1 'bit.ly/2137/' 'Joseph S'
link San Francisco 1 'bit.ly/3225/' 'Ed B'
link San Francisco 1 'bit.ly/3667/' 'George Vvec'
link San Francisco 1 'bit.ly/6434/' 'Robert W'
link San Francisco 1 'bit.ly/4036/' 'Rudy B'
link San Francisco 1 'bit.ly/6450/' 'James K'
link San Francisco 1 'bit.ly/5180/' 'Billy N'
link San Francisco 1 'bit.ly/7847/' 'John S'
The flatten_json()
function (from Medium.com above) destroys the hierarchy. Here are first few rows:
{'city_url': 'link',
'city_name_0': 'San Francisco',
'rank': 1,
'resident_0_link_0': 'bit.ly/0842/',
'resident_0_name_0': 'John A', ...
Can someone please help me how to think about converting these datasets? Unfortunately, pandas
documentation provides no guidance for beginners. HEre's what I was playing with. Nothing worked.
from pandas.io.json import json_normalize
json_normalize(d,['city',['name','rank']])
json_normalize(d,['city','name','rank'])
json_normalize(d,['city','name'])
I'd appreciate if someone guide how to do these type of conversion and the thought process.
Also, I'm looking for a vectorized operation or O(N)
operation rather than O(N2)
because of the amount of data in the original dataset. Hence, anything slower than O(N)
won't work.
If you know the structure of the json blob this will do it
resident_link = [k['link'][0] for k in d['resident']]
resident_name = [k['name'][0] for k in d['resident']]
n = len(d['resident'])
city_url = n * [d['city']['url']]
city_name = n * [d['city']['name'][0]]
rank = n * [d['rank']]
df = pandas.DataFrame({
'resident_name' : resident_name,
'resident_link' : resident_link,
'city_url' : city_url,
'city_name' : city_name,
'rank' : rank
})
Which produces
city_name city_url rank resident_link resident_name
0 San Francisco link 1 bit.ly/0842/ John A
1 San Francisco link 1 bit.ly/5835/ Tedd B
2 San Francisco link 1 bit.ly/2011/ Cobb C
3 San Francisco link 1 bit.ly/0855/ Jack N
4 San Francisco link 1 bit.ly/1430/ Jack K
5 San Francisco link 1 bit.ly/3081/ Edward
6 San Francisco link 1 bit.ly/2001/ Jack W
7 San Francisco link 1 bit.ly/0020/ Henry F
8 San Francisco link 1 bit.ly/2137/ Joseph S
9 San Francisco link 1 bit.ly/3225/ Ed B
10 San Francisco link 1 bit.ly/3667/ George Vvec
11 San Francisco link 1 bit.ly/6434/ Robert W
12 San Francisco link 1 bit.ly/4036/ Rudy B
13 San Francisco link 1 bit.ly/6450/ James K
14 San Francisco link 1 bit.ly/5180/ Billy N
15 San Francisco link 1 bit.ly/7847/ John S
EDIT
As the OP say in the comments, imagine there's many records like this, each with the same structure
nrecords = 10
dd = {k : d for k in range(nrecords)}
dd
now has 10 copies of the original json blob. And this is how the code should be updated
ff = pandas.DataFrame()
for record in range(nrecords):
n = len(dd[record]['resident'])
df = {
'resident_link' : [k['link'][0] for k in dd[record]['resident']],
'resident_name' : [k['name'][0] for k in dd[record]['resident']],
'city_url' : n * [dd[record]['city']['url']],
'city_name' : n * [dd[record]['city']['name'][0]],
'rank' : n * [dd[record]['rank']]
}
df = pandas.DataFrame(df)
ff = ff.append(df).reset_index(drop = True)
Below there's an estimation of running time as a function of number of records. Based on this it will take around 1 h to complete 1.5 million records
这篇关于使用 Python 规范化 JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!