使用 pandas 解析从CSV加载的JSON字符串 [英] Parsing a JSON string which was loaded from a CSV using Pandas

查看:104
本文介绍了使用 pandas 解析从CSV加载的JSON字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理CSV文件,其中几个列具有一个简单的json对象(几个键值对),而其他列则是正常的.这是一个示例:

I am working with CSV files where several of the columns have a simple json object (several key value pairs) while other columns are normal. Here is an example:

name,dob,stats
john smith,1/1/1980,"{""eye_color"": ""brown"", ""height"": 160, ""weight"": 76}"
dave jones,2/2/1981,"{""eye_color"": ""blue"", ""height"": 170, ""weight"": 85}"
bob roberts,3/3/1982,"{""eye_color"": ""green"", ""height"": 180, ""weight"": 94}"

使用df = pandas.read_csv('file.csv')后,解析和将stats列拆分为其他列的最有效方法是什么?

After using df = pandas.read_csv('file.csv'), what's the most efficient way to parse and split the stats column into additional columns?

大约一个小时后,我唯一能想到的是:

After about an hour, the only thing I could come up with was:

import json
stdf = df['stats'].apply(json.loads)
stlst = list(stdf)
stjson = json.dumps(stlst)
df.join(pandas.read_json(stjson))

这似乎是我做错了,考虑到我需要定期在三列上进行,这需要大量的工作.

This seems like I'm doing it wrong, and it's quite a bit of work considering I'll need to do this on three columns regularly.

所需的输出是下面的数据框对象.添加了以下几行代码以我的(糟糕的)方式到达那里:

Desired output is the dataframe object below. Added following lines of code to get there in my (crappy) way:

df = df.join(pandas.read_json(stjson))
del(df['stats'])
In [14]: df

Out[14]:
          name       dob eye_color  height  weight
0   john smith  1/1/1980     brown     160      76
1   dave jones  2/2/1981      blue     170      85
2  bob roberts  3/3/1982     green     180      94

推荐答案

有一种稍微简单的方法,但最终您必须调用json.loads.pandas.read_csv中有一个转换器的概念

There is a slightly easier way, but ultimately you'll have to call json.loads There is a notion of a converter in pandas.read_csv

converters : dict. optional

Dict of functions for converting values in certain columns. Keys can either be integers or column labels

因此,首先定义您的自定义解析器.在这种情况下,下面的方法应该起作用:

So first define your custom parser. In this case the below should work:

def CustomParser(data):
    import json
    j1 = json.loads(data)
    return j1

在您的情况下,您将得到类似的东西:

In your case you'll have something like:

df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)

我们告诉read_csv以标准方式读取数据,但对于stats列,请使用我们的自定义解析器.这会使stats列成为 dict

We are telling read_csv to read the data in the standard way, but for the stats column use our custom parsers. This will make the stats column a dict

从这里开始,我们可以使用一些技巧直接将这些列与适当的列名一起追加.这仅适用于常规数据(json对象需要具有3个值,或者至少需要在我们的CustomParser中处理缺少的值)

From here, we can use a little hack to directly append these columns in one step with the appropriate column names. This will only work for regular data (the json object needs to have 3 values or at least missing values need to be handled in our CustomParser)

df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)

在左侧,我们从stats列的元素键中获得新的列名. stats列中的每个元素都是一个字典.因此,我们正在进行批量分配.在右侧,我们使用apply分解"stats"列,以从每个键/值对中制作一个数据框.

On the Left Hand Side, we get the new column names from the keys of the element of the stats column. Each element in the stats column is a dictionary. So we are doing a bulk assign. On the Right Hand Side, we break up the 'stats' column using apply to make a data frame out of each key/value pair.

这篇关于使用 pandas 解析从CSV加载的JSON字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆