使用pandas数据框中的JSON对象优化解析文件,其中某些行中的键可能会丢失 [英] Optimize parsing file with JSON objects in pandas dataframe, where keys may be missing in some rows

查看:112
本文介绍了使用pandas数据框中的JSON对象优化解析文件,其中某些行中的键可能会丢失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找优化下面的代码的方法,该过程大约需要5秒钟,对于只有1000行的文件来说,这太慢了.

I'm looking to optimize the code below which takes ~5 seconds, which is too slow for a file of only 1000 lines.

我有一个很大的文件,其中每行包含有效的JSON,每个JSON如下所示(实际数据更大且嵌套,因此我使用此JSON代码段进行说明):

I have a large file where each line contains valid JSON, with each JSON looking like the following (the actual data is much larger and nested, so I use this JSON snippet for illustration):

  {"location":{"town":"Rome","groupe":"Advanced",
    "school":{"SchoolGroupe":"TrowMet", "SchoolName":"VeronM"}},
    "id":"145",
    "Mother":{"MotherName":"Helen","MotherAge":"46"},"NGlobalNote":2,
    "Father":{"FatherName":"Peter","FatherAge":"51"},
    "Teacher":["MrCrock","MrDaniel"],"Field":"Marketing",
     "season":["summer","spring"]}

我需要解析此文件,以便仅从每个JSON中提取一些键值,以获取结果数据框:

I need to parse this file in order to extract only some key-values from every JSON, to obtain the resulting dataframe:

Groupe      Id   MotherName   FatherName
Advanced    56   Laure         James
Middle      11   Ann           Nicolas
Advanced    6    Helen         Franc

但是我在数据框中需要的某些键在某些JSON对象中丢失了,因此我应该验证该键是否存在,如果不存在,则用Null填充相应的值.我使用以下方法:

But some keys I need in the dataframe, are missing in some JSON objects, so I should to verify if the key is present, and if not, fill the corresponding value with Null. I use with the following method:

df = pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open (path/to/file) as f:
    for chunk in f:
        jfile = json.loads(chunk)

        if 'groupe' in jfile['location']:
            groupe = jfile['location']['groupe']
        else:
            groupe=np.nan

        if 'id' in jfile:
            id = jfile['id']
        else:
            id = np.nan

        if 'MotherName' in jfile['Mother']:
            MotherName = jfile['Mother']['MotherName']
        else:
            MotherName = np.nan

        if 'FatherName' in jfile['Father']:
            FatherName = jfile['Father']['FatherName']
        else: 
            FatherName = np.nan

        df = df.append({"groupe":group, "id":id, "MotherName":MotherName, "FatherName":FatherName},
            ignore_index=True)

我需要将整个1000行文件的运行时间优化为< = 2秒.在PERL中,相同的解析函数需要< 1秒,但是我需要在Python中实现它.

I need to optimize the runtime over the whole 1000-row file to <= 2 seconds. In PERL the same parsing function takes < 1 second, but I need to implement it in Python.

推荐答案

如果可以在初始化期间一步一步构建数据帧,则将获得最佳性能. DataFrame.from_record取一个元组序列,您可以从一次读取一个记录的生成器中提供这些元组.您可以使用get更快地解析数据,当找不到该项目时,它将提供默认参数.我创建了一个名为dummy的空dict来传递中间的get,以便您知道链接的get将会起作用.

You'll get the best performance if you can build the dataframe in a single step during initialization. DataFrame.from_record takes a sequence of tuples which you can supply from a generator that reads one record at a time. You can parse the data faster with get, which will supply a default parameter when the item isn't found. I created an empty dict called dummy to pass for intermediate gets so that you know a chained get will work.

我创建了1000条记录数据集,而在我my脚的笔记本电脑上,时间从18秒缩短到了0.06秒.很好.

I created a 1000 record dataset and on my crappy laptop the time went from 18 seconds to .06 seconds. Thats pretty good.

import numpy as np
import pandas as pd
import json
import time

def extract_data(data):
    """ convert 1 json dict to records for import"""
    dummy = {}
    jfile = json.loads(data.strip())
    return (
        jfile.get('location', dummy).get('groupe', np.nan), 
        jfile.get('id', np.nan),
        jfile.get('Mother', dummy).get('MotherName', np.nan),
        jfile.get('Father', dummy).get('FatherName', np.nan))

start = time.time()
df = pd.DataFrame.from_records(map(extract_data, open('file.json')),
    columns=['group', 'id', 'Father', 'Mother'])
print('New algorithm', time.time()-start)

#
# The original way
#

start= time.time()
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open ('file.json') as f:
      for chunk in f:
           jfile=json.loads(chunk)
           if 'groupe' in jfile['location']:
               groupe=jfile['location']['groupe']
           else:
               groupe=np.nan
           if 'id' in jfile:
                id=jfile['id']
           else:
                id=np.nan
           if 'MotherName' in jfile['Mother']:
                MotherName=jfile['Mother']['MotherName']
           else:
                MotherName=np.nan
           if 'FatherName' in jfile['Father']:
                FatherName=jfile['Father']['FatherName']
           else: 
                FatherName=np.nan
           df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName},
            ignore_index=True)
print('original', time.time()-start)

这篇关于使用pandas数据框中的JSON对象优化解析文件,其中某些行中的键可能会丢失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆