提取作为字符串嵌入Pandas数据框中的嵌套JSON [英] Extract nested JSON embedded as string in Pandas dataframe

查看:212
本文介绍了提取作为字符串嵌入Pandas数据框中的嵌套JSON的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV,其中的一个字段是嵌套的JSON对象,存储为字符串.我想将CSV加载到数据帧中,并将JSON解析为附加到原始数据帧的一组字段;换句话说,提取JSON的内容并将其作为数据框的一部分.

I have a CSV where one of the fields is a nested JSON object, stored as a string. I would like to load the CSV into a dataframe and parse the JSON into a set of fields appended to the original dataframe; in other words, extract the contents of the JSON and make them part of the dataframe.

我的CSV:

id|dist|json_request
1|67|{"loc":{"lat":45.7, "lon":38.9},"arrival": "Monday", "characteristics":{"body":{"color":"red", "make":"sedan"}, "manuf_year":2014}}
2|34|{"loc":{"lat":46.89, "lon":36.7},"arrival": "Tuesday", "characteristics":{"body":{"color":"blue", "make":"sedan"}, "manuf_year":2014}}
3|98|{"loc":{"lat":45.70, "lon":31.0}, "characteristics":{"body":{"color":"yellow"}, "manuf_year":2010}}

请注意,并非所有键对于所有行都是相同的. 我希望它产生一个等效于此的数据帧:

Note that not all keys are the same for all the rows. I'd like it to produce a data frame equivalent to this:

data = {'id'     : [1, 2, 3],
        'dist'  : [67, 34, 98],
        'loc_lat': [45.7, 46.89, 45.70],
        'loc_lon': [38.9, 36.7, 31.0],
        'arrival': ["Monday", "Tuesday", "NA"],
        'characteristics_body_color':["red", "blue", "yellow"],
        'characteristics_body_make':["sedan", "sedan", "NA"],
        'characteristics_manuf_year':[2014, 2014, 2010]}
df = pd.DataFrame(data)

(非常抱歉,我无法让桌子本身看起来很合理!请不要生我的气,我是菜鸟:()

(I'm really sorry, I can't get the table itself to look sensible in SO! Please don't be mad at me, I'm a rookie :( )

经过大量的研究,我提出了以下解决方案:

After a lot of futzing around, I came up with the following solution:

#Import data
df_raw = pd.read_csv("sample.csv", delimiter="|")

#Parsing function
def parse_request(s):
    sj = json.loads(s)
    norm = json_normalize(sj)
    return norm

#Create an empty dataframe to store results
parsed = pd.DataFrame(columns=['id'])

#Loop through and parse JSON in each row
for i in df_raw.json_request:
    parsed = parsed.append(parse_request(i))

#Merge results back onto original dataframe
df_parsed = df_raw.join(parsed)

这显然是微不足道的,而且效率很低(我必须解析的300K行将花费多个小时).有更好的方法吗?

This is obviously inelegant and really inefficient (would take multiple hours on the 300K rows that I have to parse). Is there a better way?

我经历了以下相关问题: 将CSV读取到熊猫的其中一列是json字符串 (这似乎仅适用于简单的,非嵌套的JSON)

I've gone through the following related questions: Reading a CSV into pandas where one column is a json string (which seems to only work for simple, non-nested JSON)

JSON转换为pandas DataFrame (我从中借用了部分解决方案,但我想不出如何在不遍历行的情况下跨数据框应用此解决方案)

JSON to pandas DataFrame (I borrowed parts of my solutions from this, but I can't figure out how to apply this solution across the dataframe without looping through rows)

我正在使用Python 3.3和Pandas 0.17.

I'm using Python 3.3 and Pandas 0.17.

推荐答案

这是一种将处理速度提高10到100倍的方法,应该可以让您在一分钟之内读取大文件,而不是一小时.这个想法是只在读取完所有数据后才构造一个数据帧,从而减少了需要分配内存的次数,并且只对整个数据块而不是每一行调用一次json_normalize:

Here's an approach that speeds things up by a factor of 10 to 100, and should allow you to read your big file in under a minute, as opposed to over an hour. The idea is to only construct a dataframe once all of the data has been read, thereby reducing the number of times memory needs to be allocated, and to only call json_normalize once on the entire chunk of data, rather than on each row:

import csv
import json

import pandas as pd
from pandas.io.json import json_normalize

with open('sample.csv') as fh:
    rows = csv.reader(fh, delimiter='|')
    header = next(rows)

    # "transpose" the data. `data` is now a tuple of strings
    # containing JSON, one for each row
    idents, dists, data = zip(*rows)

data = [json.loads(row) for row in data]
df = json_normalize(data)
df['ids'] = idents
df['dists'] = dists

因此:

>>> print(df)

   arrival characteristics.body.color characteristics.body.make  \
0   Monday                        red                     sedan   
1  Tuesday                       blue                     sedan   
2      NaN                     yellow                       NaN   

   characteristics.manuf_year  loc.lat  loc.lon ids  
0                        2014    45.70     38.9   1  
1                        2014    46.89     36.7   2  
2                        2010    45.70     31.0   3


此外,我研究了pandasjson_normalize在做什么,并且它正在执行一些深层复制,如果您只是从CSV创建数据帧,则不需要这样做.我们可以实现自己的flatten函数,该函数需要一个字典并展平"键,这与json_normalize相似.然后,我们可以生成一个生成器,该生成器一次吐出数据帧的一行作为记录.这种方法甚至更快:


Furthermore, I looked into what pandas's json_normalize is doing, and it's performing some deep copies that shouldn't be necessary if you're just creating a dataframe from a CSV. We can implement our own flatten function which takes a dictionary and "flattens" the keys, similar to what json_normalize does. Then we can make a generator which spits out one row of the dataframe at a time as a record. This approach is even faster:

def flatten(dct, separator='_'):
    """A fast way to flatten a dictionary,"""
    res = {}
    queue = [('', dct)]

    while queue:
        prefix, d = queue.pop()
        for k, v in d.items():
            key = prefix + k
            if not isinstance(v, dict):
                res[key] = v
            else:
                queue.append((key + separator, v))

    return res

def records_from_json(fh):
    """Yields the records from a file object."""
    rows = csv.reader(fh, delimiter='|')
    header = next(rows)
    for ident, dist, data in rows:
        rec = flatten(json.loads(data))
        rec['id'] = ident
        rec['dist'] = dist
        yield rec

def from_records(path):
    with open(path) as fh:
        return pd.DataFrame.from_records(records_from_json(fh))


这是计时实验的结果,其中我通过重复行来人为地增加了样本数据的大小.行数用n_rows:

        method 1 (s)  method 2 (s)  original time (s)
n_rows                                               
96          0.008217      0.002971           0.362257
192         0.014484      0.004720           0.678590
384         0.027308      0.008720           1.373918
768         0.055644      0.016175           2.791400
1536        0.105730      0.030914           5.727828
3072        0.209049      0.060105          11.877403

线性推算,第一种方法应在大约20秒内读取30万行,而第二种方法应在6秒钟左右.

Extrapolating linearly, the first method should read 300k lines in about 20 seconds, while the second method should take around 6 seconds.

这篇关于提取作为字符串嵌入Pandas数据框中的嵌套JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆