如何在 pandas 中读取大json? [英] How to read a large json in pandas?

查看:111
本文介绍了如何在 pandas 中读取大json?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的代码是:data_review=pd.read_json('review.json') 我的数据review如下:

My code is :data_review=pd.read_json('review.json') I have the data review as fllow:

{
    // string, 22 character unique review id
    "review_id": "zdSx_SD6obEhz9VrW9uAWA",

    // string, 22 character unique user id, maps to the user in user.json
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",

    // string, 22 character business id, maps to business in business.json
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // integer, star rating
    "stars": 4,

    // string, date formatted YYYY-MM-DD
    "date": "2016-03-09",

    // string, the review itself
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",

    // integer, number of useful votes received
    "useful": 0,

    // integer, number of funny votes received
    "funny": 0,

    // integer, number of cool votes received
    "cool": 0
}

但是我收到了以下错误:

But I got the follow error:

    333             fh, handles = _get_handle(filepath_or_buffer, 'r',
    334                                       encoding=encoding)
--> 335             json = fh.read()
    336             fh.close()
    337         else:

OSError: [Errno 22] Invalid argument

我的jsonfile不包含任何注释和3.8G! 我只是从这里下载文件来练习链接

My jsonfile do not contain any comments and 3.8G! I just download the file from here to practice link

当我使用以下代码时,抛出相同的错误:

When I use the follow code,throw the same error:

import json
with open('review.json') as json_file:
    data = json.load(json_file)

推荐答案

也许正在读取的文件包含多个json对象,而不是方法json.load(json_file)pd.read_json('review.json')所期望的单个json或数组对象.这些方法应该使用单个json对象读取文件.

Perhaps, the file you are reading contains multiple json objects rather and than a single json or array object which the methods json.load(json_file) and pd.read_json('review.json') are expecting. These methods are supposed to read files with single json object.

从我看到的yelp数据集中,您的文件必须包含类似以下内容的内容:

From the yelp dataset I have seen, your file must be containing something like:

{"review_id":"xxxxx","user_id":"xxxxx","business_id":"xxxx","stars":5,"date":"xxx-xx-xx","text":"xyxyxyxyxx","useful":0,"funny":0,"cool":0}
{"review_id":"yyyy","user_id":"yyyyy","business_id":"yyyyy","stars":3,"date":"yyyy-yy-yy","text":"ababababab","useful":0,"funny":0,"cool":0}
....    
....

and so on.

因此,重要的是要意识到这不是单个json数据,而是一个文件中的多个json对象.

Hence, it is important to realize that this is not single json data rather it is multiple json objects in one file.

要将这些数据读入pandas数据框,应采用以下解决方案:

To read this data into pandas data frame the following solution should work:

import pandas as pd

with open('review.json') as json_file:      
    data = json_file.readlines()
    # this line below may take at least 8-10 minutes of processing for 4-5 million rows. It converts all strings in list to actual json objects. 
    data = list(map(json.loads, data)) 

pd.DataFrame(data)

假设数据量很大,我认为您的机器将花费大量时间将数据加载到数据帧中.

Assuming the size of data to be pretty large, I think your machine will take considerable amount of time to load the data into data frame.

这篇关于如何在 pandas 中读取大json?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆