将多嵌套的dict/json加载到 pandas 中 [英] Load multi nested dict/json into pandas
问题描述
我正试图将一个令人困惑的多嵌套JSON
加载到熊猫中.我已经在使用 json_normalize ,但试图弄清楚我如何加入2个similair嵌套的dict
以及解包其子dict
和list
的方法一直困扰着我.我对大熊猫的了解有限,但我假设如果可以解决的话,我可以利用它的性能优势.
我有2个包含战争数据的字典,一个是从JSON API响应中加载的,另一个是在数据库中的.我正在尝试比较2个新的攻击和防御方式.
示例战争
{
"state": "active",
"team_size": 20,
"teams": {
"id": "12345679",
"name": "Good Guys",
"level": 10,
"attacks": 4,
"destruction_percentage": 22.6,
"members": [
{
"id": "1",
"name": "John",
"level": 12
},
{
"id": "2",
"name": "Tom",
"level": 11,
"attacks": [
{
"attackerTag": "2",
"defenderTag": "4",
"damage": 64,
"order": 7
}
]
}
]
},
"opponent": {
"id": "987654321",
"name": "Bad Guys",
"level": 17,
"attacks": 5,
"damage": 20.95,
"members": [
{
"id": "3",
"name": "Betty",
"level": 17,
"attacks": [
{
"attacker_id": "3",
"defender_id": "1",
"damage": 70,
"order": 1
},
{
"attacker_id": "3",
"defender_id": "7",
"damage": 100,
"order": 11
}
],
"opponentAttacks": 0,
"some_useless_data": "Want to ignore, this doesn't show in every record"
},
{
"id": "4",
"name": "Fred",
"level": 9,
"attacks": [
{
"attacker_id": "4",
"defender_id": "9",
"damage": 70,
"order": 4
}
],
"opponentAttacks": 0
}
]
}
}
现在,我认为就性能而言,熊猫将是我的最佳选择,而不是将它们拉在一起并遍历每个成员并进行比较.
所以我很难说出如何使dataframe
很好地展平并且易于遍历的尝试.最佳情况下,我假设采用以下布局.我只是想让两个团队都进入所有成员的单个df
中. 我们可以省略state
和team_size
键,而只关注获取每个成员及其各自的attacks
和team_id
的
示例 df
(预期结果):
id name level attacks member.team_id ...
1 John 12 NaN "123456789"
2 Tom 11 [{...}] "123456789"
3 Betty 17 [{...}, {...}] "987654321"
4 Fred 9 [{...}] "987654321"
这是我想要的df
的基本要点.这样,我就可以同时获取两个数据帧并进行比较,以进行新的攻击.</p>
注意 在尝试之前,我只是从字典中pop()
,state
和team_size
,因为我想要的是所有成员,并且团队几乎全部嵌入其中
我没有运气就尝试了以下方法,我知道这不是正确的方法,因为它在字典树上向后工作.
old_df = json_normalize(war,
'members',
['id', 'name', 'level', 'attacks'],
record_prefix='member')
#Traceback (most recent call last):
# File "test.py", line 83, in <module>
# new_parse(old_war, new_war)
# File "test.py", line 79, in new_parse
# record_prefix='member')
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 262, in json_normalize
# _recursive_extract(data, record_path, {}, level=0)
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 238, in _recursive_extract
# recs = _pull_field(obj, path[0])
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 185, in _pull_field
# result = result[spec]
#KeyError: 'members'
我以为我可以使用类似以下内容的东西,但这也不起作用.
df = pd.DataFrame.from_dict(old, orient='index')
df.droplevel('members')
#Traceback (most recent call last):
# File "test.py", line 106, in <module>
# new_parse(old_war, new_war)
# File "test.py", line 87, in new_parse
# df.droplevel('members')
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 4376, in __getattr__
# return object.__getattribute__(self, name)
#AttributeError: 'DataFrame' object has no attribute 'droplevel'
我感谢任何指导!希望我投入了足够的精力来帮助理解我的预期结果,否则请告诉我!
修改 公平地讲,我确实知道如何做到这一点,只是循环字典并创建一个具有适当日期的新成员列表,但是我觉得这比使用熊猫要低得多,因为我正在为数以百万计的战争做这件事.线程化的应用程序以及我可以从中获得的每一个性能,对我和应用程序都是一个好处. -再次感谢!
我相信您可以使用:
need = ['member.id', 'member.name', 'member.level', 'member.attacks','id']
df1 = json_normalize(war['teams'],
'members',
['id', 'name', 'level', 'attacks'],
record_prefix='member.')[need]
#print (df1)
df2 = json_normalize(war['opponent'],
'members',
['id', 'name', 'level', 'attacks'],
record_prefix='member.')[need]
#print (df2)
df1.columns = np.where(df1.columns.str.startswith('member.'),
df1.columns.str.split('.', n=1).str[1],
'member.' + df1.columns)
df2.columns = np.where(df2.columns.str.startswith('member.'),
df2.columns.str.split('.', n=1).str[1],
'member.' + df2.columns)
df = pd.concat([df1, df2], sort=False, ignore_index=True)
print (df)
id name level attacks \
0 1 John 12 NaN
1 2 Tom 11 [{'attackerTag': '2', 'defenderTag': '4', 'dam...
2 3 Betty 17 [{'attacker_id': '3', 'defender_id': '1', 'dam...
3 4 Fred 9 [{'attacker_id': '4', 'defender_id': '9', 'dam...
member.id
0 12345679
1 12345679
2 987654321
3 987654321
I'm trying to load a pretty confusing multi-nested JSON
into pandas. I'm already using json_normalize but trying to figure out how I'd join 2 similair nested dict
s as well as unpack their sub dict
s and list
s has been stumping me. I have limited understanding of pandas but I'm assuming I can leverage it's performance advantages if I can just get this down.
I have 2 dict's containing war data, one loaded from a JSON API response and one in a database. I'm trying to compare the 2 for new attacks and defenses.
example war
{
"state": "active",
"team_size": 20,
"teams": {
"id": "12345679",
"name": "Good Guys",
"level": 10,
"attacks": 4,
"destruction_percentage": 22.6,
"members": [
{
"id": "1",
"name": "John",
"level": 12
},
{
"id": "2",
"name": "Tom",
"level": 11,
"attacks": [
{
"attackerTag": "2",
"defenderTag": "4",
"damage": 64,
"order": 7
}
]
}
]
},
"opponent": {
"id": "987654321",
"name": "Bad Guys",
"level": 17,
"attacks": 5,
"damage": 20.95,
"members": [
{
"id": "3",
"name": "Betty",
"level": 17,
"attacks": [
{
"attacker_id": "3",
"defender_id": "1",
"damage": 70,
"order": 1
},
{
"attacker_id": "3",
"defender_id": "7",
"damage": 100,
"order": 11
}
],
"opponentAttacks": 0,
"some_useless_data": "Want to ignore, this doesn't show in every record"
},
{
"id": "4",
"name": "Fred",
"level": 9,
"attacks": [
{
"attacker_id": "4",
"defender_id": "9",
"damage": 70,
"order": 4
}
],
"opponentAttacks": 0
}
]
}
}
Now I'm assuming pandas would be my best option as far as performance goes rather than zipping them together and looping through each member and comparing them.
So my attempt to get a dataframe
that flattens nicely and easy to traverse have been difficult to say the least. Optimally I'd assume the following layout. I'm only trying to get both teams into a single df
of just all the members. we can omit the state
and team_size
keys and focus on just getting each member and their respective attacks
and team_id
's
example df
(expected, result):
id name level attacks member.team_id ...
1 John 12 NaN "123456789"
2 Tom 11 [{...}] "123456789"
3 Betty 17 [{...}, {...}] "987654321"
4 Fred 9 [{...}] "987654321"
That's the basic gist of what I'd like as a df
. So then I can take both dataframes and compare for new attacks.
Note I just pop()
'd state
and team_size
from the dict before my attempts as all I want is all members, with the team embedded into it pretty much
I have tried the following with no luck and I know it's not the correct way because it's working backwards up the dict tree.
old_df = json_normalize(war,
'members',
['id', 'name', 'level', 'attacks'],
record_prefix='member')
#Traceback (most recent call last):
# File "test.py", line 83, in <module>
# new_parse(old_war, new_war)
# File "test.py", line 79, in new_parse
# record_prefix='member')
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 262, in json_normalize
# _recursive_extract(data, record_path, {}, level=0)
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 238, in _recursive_extract
# recs = _pull_field(obj, path[0])
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 185, in _pull_field
# result = result[spec]
#KeyError: 'members'
I assumed I could use something like the following but that isn't working either.
df = pd.DataFrame.from_dict(old, orient='index')
df.droplevel('members')
#Traceback (most recent call last):
# File "test.py", line 106, in <module>
# new_parse(old_war, new_war)
# File "test.py", line 87, in new_parse
# df.droplevel('members')
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 4376, in __getattr__
# return object.__getattribute__(self, name)
#AttributeError: 'DataFrame' object has no attribute 'droplevel'
I appreciate any guidance! Hopefully I put enough in to help understand my intended result, if not please let me know!
Edit To be fair, I do know how to do this just looping the dict and creating a new list of members with the appropriate date, but this I feel is much more inefficient than using pandas as I'm doing this for millions of wars in a threaded application and every bit of performance I can get out of it is a bonus for me and the app. - Thanks again!
I believe you can use:
need = ['member.id', 'member.name', 'member.level', 'member.attacks','id']
df1 = json_normalize(war['teams'],
'members',
['id', 'name', 'level', 'attacks'],
record_prefix='member.')[need]
#print (df1)
df2 = json_normalize(war['opponent'],
'members',
['id', 'name', 'level', 'attacks'],
record_prefix='member.')[need]
#print (df2)
df1.columns = np.where(df1.columns.str.startswith('member.'),
df1.columns.str.split('.', n=1).str[1],
'member.' + df1.columns)
df2.columns = np.where(df2.columns.str.startswith('member.'),
df2.columns.str.split('.', n=1).str[1],
'member.' + df2.columns)
df = pd.concat([df1, df2], sort=False, ignore_index=True)
print (df)
id name level attacks \
0 1 John 12 NaN
1 2 Tom 11 [{'attackerTag': '2', 'defenderTag': '4', 'dam...
2 3 Betty 17 [{'attacker_id': '3', 'defender_id': '1', 'dam...
3 4 Fred 9 [{'attacker_id': '4', 'defender_id': '9', 'dam...
member.id
0 12345679
1 12345679
2 987654321
3 987654321
这篇关于将多嵌套的dict/json加载到 pandas 中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!