pandas.io.json.json_normalize与非常嵌套的json [英] pandas.io.json.json_normalize with very nested json

查看:196
本文介绍了pandas.io.json.json_normalize与非常嵌套的json的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试normalize一个非常嵌套的json文件,稍后将对其进行分析.我正在努力的是如何进行一个以上的深度标准化处理.

I have been trying to normalize a very nested json file I will later analyze. What I am struggling with is how to go more than one level deep to normalize.

我经历了 pandas.io.json.json_normalize 文档,因为它确实可以实现我想要的功能.

I went through the pandas.io.json.json_normalize documentation, since it does exactly what I want it to do.

我已经能够对其中的一部分进行规范化,并且现在了解字典的工作原理,但是我仍然没有.

I have been able to normalize part of it and now understand how dictionaries work, but I am still not there.

使用下面的代码,我只能获得第一级.

With below code I am able to get only the first level.

import json
import pandas as pd
from pandas.io.json import json_normalize

with open('authors_sample.json') as f:
    d = json.load(f)

raw = json_normalize(d['hits']['hits'])

authors = json_normalize(data = d['hits']['hits'], 
                         record_path = '_source', 
                         meta = ['_id', ['_source', 'journal'], ['_source', 'title'], 
                                 ['_source', 'normalized_venue_name']
                                 ])

我正在尝试使用以下代码挖掘"到作者"字典,但是record_path = ['_source', 'authors']抛出我TypeError: string indices must be integers.据我了解json_normalize的逻辑应该很好,但是我仍然不太了解如何使用dict vs list深入研究json.

I am trying to 'dig' into the 'authors' dictionary with below code, but the record_path = ['_source', 'authors'] throws me TypeError: string indices must be integers. As far as I understand json_normalize the logic should be good, but I still don't quite understand how to dive into a json with dict vs list.

我什至经历了这个简单的示例.

I even went through this simple example.

authors = json_normalize(data = d['hits']['hits'], 
                         record_path = ['_source', 'authors'], 
                         meta = ['_id', ['_source', 'journal'], ['_source', 'title'], 
                                 ['_source', 'normalized_venue_name']
                                 ])

下面是json文件的一大块(5条记录).

Below is a chunk of the json file (5 records).

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'7CB3F2AD',
    u'_index': u'scibase_listings',
    u'_score': 1.0,
    u'_source': {u'authors': None,
     u'deleted': 0,
     u'description': None,
     u'doi': u'',
     u'is_valid': 1,
     u'issue': None,
     u'journal': u'Physical Review Letters',
     u'link': None,
     u'meta_description': None,
     u'meta_keywords': None,
     u'normalized_venue_name': u'phys rev lett',
     u'pages': None,
     u'parent_keywords': [u'Chromatography',
      u'Quantum mechanics',
      u'Particle physics',
      u'Quantum field theory',
      u'Analytical chemistry',
      u'Quantum chromodynamics',
      u'Physics',
      u'Mass spectrometry',
      u'Chemistry'],
     u'pub_date': u'1987-03-02 00:00:00',
     u'pubtype': None,
     u'rating_avg_weighted': 0,
     u'rating_clarity': 0.0,
     u'rating_clarity_weighted': 0.0,
     u'rating_innovation': 0.0,
     u'rating_innovation_weighted': 0.0,
     u'rating_num_weighted': 0,
     u'rating_reproducability': 0,
     u'rating_reproducibility_weighted': 0.0,
     u'rating_versatility': 0.0,
     u'rating_versatility_weighted': 0.0,
     u'review_count': 0,
     u'tag': [u'mass spectra', u'elementary particles', u'bound states'],
     u'title': u'Evidence for a new meson: A quasinuclear NN-bar bound state',
     u'userAvg': 0.0,
     u'user_id': None,
     u'venue_name': u'Physical Review Letters',
     u'views_count': 0,
     u'volume': None},
    u'_type': u'listing'},
   {u'_id': u'7AF8EBC3',
    u'_index': u'scibase_listings',
    u'_score': 1.0,
    u'_source': {u'authors': [{u'affiliations': [u'Punjabi University'],
       u'author_id': u'780E3459',
       u'author_name': u'munish puri'},
      {u'affiliations': [u'Punjabi University'],
       u'author_id': u'48D92C79',
       u'author_name': u'rajesh dhaliwal'},
      {u'affiliations': [u'Punjabi University'],
       u'author_id': u'7D9BD37C',
       u'author_name': u'r s singh'}],
     u'deleted': 0,
     u'description': None,
     u'doi': u'',
     u'is_valid': 1,
     u'issue': None,
     u'journal': u'Journal of Industrial Microbiology & Biotechnology',
     u'link': None,
     u'meta_description': None,
     u'meta_keywords': None,
     u'normalized_venue_name': u'j ind microbiol biotechnol',
     u'pages': None,
     u'parent_keywords': [u'Nuclear medicine',
      u'Psychology',
      u'Hydrology',
      u'Chromatography',
      u'X-ray crystallography',
      u'Nuclear fusion',
      u'Medicine',
      u'Fluid dynamics',
      u'Thermodynamics',
      u'Physics',
      u'Gas chromatography',
      u'Radiobiology',
      u'Engineering',
      u'Organic chemistry',
      u'High-performance liquid chromatography',
      u'Chemistry',
      u'Organic synthesis',
      u'Psychotherapist'],
     u'pub_date': u'2008-04-04 00:00:00',
     u'pubtype': None,
     u'rating_avg_weighted': 0,
     u'rating_clarity': 0.0,
     u'rating_clarity_weighted': 0.0,
     u'rating_innovation': 0.0,
     u'rating_innovation_weighted': 0.0,
     u'rating_num_weighted': 0,
     u'rating_reproducability': 0,
     u'rating_reproducibility_weighted': 0.0,
     u'rating_versatility': 0.0,
     u'rating_versatility_weighted': 0.0,
     u'review_count': 0,
     u'tag': [u'flow rate',
      u'operant conditioning',
      u'packed bed reactor',
      u'immobilized enzyme',
      u'specific activity'],
     u'title': u'Development of a stable continuous flow immobilized enzyme reactor for the hydrolysis of inulin',
     u'userAvg': 0.0,
     u'user_id': None,
     u'venue_name': u'Journal of Industrial Microbiology & Biotechnology',
     u'views_count': 0,
     u'volume': None},
    u'_type': u'listing'},
   {u'_id': u'7521A721',
    u'_index': u'scibase_listings',
    u'_score': 1.0,
    u'_source': {u'authors': [{u'author_id': u'7FF872BC',
       u'author_name': u'barbara eileen ryan'}],
     u'deleted': 0,
     u'description': None,
     u'doi': u'',
     u'is_valid': 1,
     u'issue': None,
     u'journal': u'The American Historical Review',
     u'link': None,
     u'meta_description': None,
     u'meta_keywords': None,
     u'normalized_venue_name': u'american historical review',
     u'pages': None,
     u'parent_keywords': [u'Social science',
      u'Politics',
      u'Sociology',
      u'Law'],
     u'pub_date': u'1992-01-01 00:00:00',
     u'pubtype': None,
     u'rating_avg_weighted': 0,
     u'rating_clarity': 0.0,
     u'rating_clarity_weighted': 0.0,
     u'rating_innovation': 0.0,
     u'rating_innovation_weighted': 0.0,
     u'rating_num_weighted': 0,
     u'rating_reproducability': 0,
     u'rating_reproducibility_weighted': 0.0,
     u'rating_versatility': 0.0,
     u'rating_versatility_weighted': 0.0,
     u'review_count': 0,
     u'tag': [u'social movements'],
     u'title': u"Feminism and the women's movement : dynamics of change in social movement ideology, and activism",
     u'userAvg': 0.0,
     u'user_id': None,
     u'venue_name': u'The American Historical Review',
     u'views_count': 0,
     u'volume': None},
    u'_type': u'listing'},
   {u'_id': u'7DAEB9A4',
    u'_index': u'scibase_listings',
    u'_score': 1.0,
    u'_source': {u'authors': [{u'author_id': u'0299B8E9',
       u'author_name': u'fraser j harbutt'}],
     u'deleted': 0,
     u'description': None,
     u'doi': u'',
     u'is_valid': 1,
     u'issue': None,
     u'journal': u'The American Historical Review',
     u'link': None,
     u'meta_description': None,
     u'meta_keywords': None,
     u'normalized_venue_name': u'american historical review',
     u'pages': None,
     u'parent_keywords': [u'Superconductivity',
      u'Nuclear fusion',
      u'Geology',
      u'Chemistry',
      u'Metallurgy'],
     u'pub_date': u'1988-01-01 00:00:00',
     u'pubtype': None,
     u'rating_avg_weighted': 0,
     u'rating_clarity': 0.0,
     u'rating_clarity_weighted': 0.0,
     u'rating_innovation': 0.0,
     u'rating_innovation_weighted': 0.0,
     u'rating_num_weighted': 0,
     u'rating_reproducability': 0,
     u'rating_reproducibility_weighted': 0.0,
     u'rating_versatility': 0.0,
     u'rating_versatility_weighted': 0.0,
     u'review_count': 0,
     u'tag': [u'iron'],
     u'title': u'The iron curtain : Churchill, America, and the origins of the Cold War',
     u'userAvg': 0.0,
     u'user_id': None,
     u'venue_name': u'The American Historical Review',
     u'views_count': 0,
     u'volume': None},
    u'_type': u'listing'},
   {u'_id': u'7B3236C5',
    u'_index': u'scibase_listings',
    u'_score': 1.0,
    u'_source': {u'authors': [{u'author_id': u'7DAB7B72',
       u'author_name': u'richard m freeland'}],
     u'deleted': 0,
     u'description': None,
     u'doi': u'',
     u'is_valid': 1,
     u'issue': None,
     u'journal': u'The American Historical Review',
     u'link': None,
     u'meta_description': None,
     u'meta_keywords': None,
     u'normalized_venue_name': u'american historical review',
     u'pages': None,
     u'parent_keywords': [u'Political Science', u'Economics'],
     u'pub_date': u'1985-01-01 00:00:00',
     u'pubtype': None,
     u'rating_avg_weighted': 0,
     u'rating_clarity': 0.0,
     u'rating_clarity_weighted': 0.0,
     u'rating_innovation': 0.0,
     u'rating_innovation_weighted': 0.0,
     u'rating_num_weighted': 0,
     u'rating_reproducability': 0,
     u'rating_reproducibility_weighted': 0.0,
     u'rating_versatility': 0.0,
     u'rating_versatility_weighted': 0.0,
     u'review_count': 0,
     u'tag': [u'foreign policy'],
     u'title': u'The Truman Doctrine and the origins of McCarthyism : foreign policy, domestic politics, and internal security, 1946-1948',
     u'userAvg': 0.0,
     u'user_id': None,
     u'venue_name': u'The American Historical Review',
     u'views_count': 0,
     u'volume': None},
    u'_type': u'listing'}],
  u'max_score': 1.0,
  u'total': 36429433},
 u'timed_out': False,
 u'took': 170}

推荐答案

在下面的熊猫示例中,方括号是什么意思?有没有遵循[]的逻辑.

In the pandas example (below) what do the brackets mean? Is there a logic to be followed to go deeper with the [].

['state', 'shortname', ['info', 'governor']]中的每个元素是指向除选定行之外要包含的元素的路径. 'counties'参数设置应生成哪些行,第二个参数添加将包含在这些行中的元数据.

Each element in the ['state', 'shortname', ['info', 'governor']] is a path to an element to include, in addition to the selected rows. The 'counties' argument set what rows should be produced, and that second argument adds metadata that will be included with those rows.

每个都是路径,列表是一个嵌套结构.在示例输出中,您会在stateshortnameinfo.governor列中看到相应的值.

Each is path, a list is a nested structure. In the example output you see the corresponding values in the state, shortname and info.governor columns.

在示例JSON中,很少有嵌套列表可以使用第一个参数进行提升,就像示例中的'counties'一样.数据结构中的唯一示例是嵌套的'authors'键.您必须提取每个['_source', 'authors']路径,然后可以从父对象中添加其他键以增加这些行:

In your example JSON, there are few nested lists to elevate with the first argument, like 'counties' did in the example. The only example in that datastructure is the nested 'authors' key; you'd have to extract each ['_source', 'authors'] path, after which you can add other keys from the parent object to augment those rows:

>>> json_normalize(raw, [['_source', 'authors']], ['_id', ['_source', 'journal'], ['_source', 'title']])
                      affiliations author_id          author_name       _id  \
0                              NaN  166468F4  a bowdoin van riper  7FDFEB02
1                              NaN  81070854   jeffrey h schwartz  7FDFEB02
2  [Pennsylvania State University]  7E15BDFA       roger l geiger  7538108B

                  _source.journal  \
0  The American Historical Review
1  The American Historical Review
2  The American Historical Review

                                       _source.title
0  Men Among the Mammoths: Victorian Science and ...
1  Men Among the Mammoths: Victorian Science and ...
2  Elizabeth Popp Berman. Creating the Market Uni...

这是作者的数据框,其中为每个作者添加了元数据(_id值,期刊名称和文章标题).

So this is a dataframe of authors, with added metadata for each author (_id value, journal name and article title).

注意第一个参数的路径;如果要列出嵌套路径,则需要提供路径列表(即使它只是一个路径);只是['_source', 'authors']会查找两个行源,每个源都有一个简单的顶级名称.

Note the path for the first argument; if you want to list a nested path you need to provide a list of paths (even if it is just one path); just ['_source', 'authors'] would look for two row sources, each a simple top-level name.

然后,第二个参数从最外层对象中拉出_id键,但是标题和日记名称是list路径,因为它们也是嵌套的.

The second argument then pulls in the _id key from the outermost object, but the title and journal name are list paths, as these are nested too.

这篇关于pandas.io.json.json_normalize与非常嵌套的json的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆