从JSON对象创建 pandas 数据框 [英] Create pandas dataframe from json objects

查看:76
本文介绍了从JSON对象创建 pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我终于从具有许多json对象的文件中获得了我需要的数据输出,但是当它在数据中循环时,我需要一些帮助将下面的输出转换为单个数据帧.这是产生输出的代码,包括输出外观的示例:

I finally have output of data I need from a file with many json objects but I need some help with converting the below output into a single dataframe as it loops through the data. Here is the code to produce the output including a sample of what the output looks like:

原始数据:

{
"zipcode":"08989",
"current"{"canwc":null,"cig":4900,"class":"observation","clds":"OVC","day_ind":"D","dewpt":19,"expireTimeGMT":1385486700,"feels_like":34,"gust":null,"hi":37,"humidex":null,"icon_code":26,"icon_extd":2600,"max_temp":37,"wxMan":"wx1111"},
"triggers":[53,31,9,21,48,7,40,178,55,179,176,26,103,175,33,51,20,57,112,30,50,113]
}
{
"zipcode":"08990",
"current":{"canwc":null,"cig":4900,"class":"observation","clds":"OVC","day_ind":"D","dewpt":19,"expireTimeGMT":1385486700,"feels_like":34,"gust":null,"hi":37,"humidex":null,"icon_code":26,"icon_extd":2600,"max_temp":37, "wxMan":"wx1111"},
"triggers":[53,31,9,21,48,7,40,178,55,179,176,26,103,175,33,51,20,57,112,30,50,113]
}

def lines_per_n(f, n):
    for line in f:
        yield ''.join(chain([line], itertools.islice(f, n - 1)))

for fin in glob.glob('*.txt'):
    with open(fin) as f:
        for chunk in lines_per_n(f, 5):
            try:
                jfile = json.loads(chunk)
                zipcode = jfile['zipcode']
                datetime = jfile['current']['proc_time']
                triggers = jfile['triggers']
                print pd.Series(jfile['zipcode']), 
                      pd.Series(jfile['current']['proc_time']),\
                      jfile['triggers']          
            except ValueError, e:
                pass
            else:
                pass

我运行上面的命令时得到的样本输出,我希望将其存储为3列的pandas数据框.

Sample output I get when I run the above which I would like to store in a pandas dataframe as 3 columns.

08988 20131126102946 []
08989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179]
08988 20131126102946 []
08989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179]
00544 20131126102946 [178, 30, 176, 103, 179, 112, 21, 20, 48]

因此,下面的代码似乎更接近,因为如果我在列表中传递并转置df,它会给我一个时髦的df.关于如何正确调整此形状的任何想法吗?

So the below code seems a lot closer in that it gives me a funky df if I pass the in the list and Transpose the df. Any idea on how I can get this reshaped properly?

def series_chunk(chunk):
    jfile = json.loads(chunk)
    zipcode = jfile['zipcode']
    datetime = jfile['current']['proc_time']
    triggers = jfile['triggers']
    return jfile['zipcode'],\
            jfile['current']['proc_time'],\
            jfile['triggers']

for fin in glob.glob('*.txt'):
    with open(fin) as f:
        for chunk in lines_per_n(f, 7):
            df1 = pd.DataFrame(list(series_chunk(chunk)))
            print df1.T

[u'08988', u'20131126102946', []]
[u'08989', u'20131126102946', [53, 31, 9, 21, 48, 7, 40, 178, 55, 179]]
[u'08988', u'20131126102946', []]
[u'08989', u'20131126102946', [53, 31, 9, 21, 48, 7, 40, 178, 55, 179]]

数据框:

   0               1   2
0  08988  20131126102946  []
       0               1                                                  2
0  08989  20131126102946  [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
       0               1   2
0  08988  20131126102946  []
       0               1                                                  2
0  08989  20131126102946  [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...

这是我的最终代码和输出.如何捕获通过循环创建的每个数据帧,并快速将它们串联为一个数据帧对象?

Here is my final code and output. How do I capture each dataframe it creates through the loop and concatenate them on the fly as one dataframe object?

for fin in glob.glob('*.txt'):
    with open(fin) as f:
        print pd.concat([series_chunk(chunk) for chunk in lines_per_n(f, 7)], axis=1).T

       0               1                                                  2
0  08988  20131126102946                                                 []
1  08989  20131126102946  [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
       0               1                                                  2
0  08988  20131126102946                                                 []
1  08989  20131126102946  [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...

推荐答案

注意:对于那些遇到此问题并希望将json解析为熊猫的人,如果您确实具有 valid json(此问题,不),那么您应该使用熊猫 功能:

Note: For those of you arriving at this question looking to parse json into pandas, if you do have valid json (this question doesn't) then you should use pandas read_json function:

# can either pass string of the json, or a filepath to a file with valid json
In [99]: pd.read_json('[{"A": 1, "B": 2}, {"A": 3, "B": 4}]')
Out[99]:
   A  B
0  1  2
1  3  4

查看文档的IO部分示例,可以传递给此函数的参数以及标准化结构化程度较低的json的方法.

Check out the IO part of the docs for several examples, arguments you can pass to this function, as well as ways to normalize less structured json.

如果您没有有效的json ,通常在将字符串读为json之前对字符串进行修改非常有效,例如

If you don't have valid json, it's often efficient to munge the string before reading in as json, for example see this answer.

如果您有多个json文件,则应将DataFrames连接在一起(类似于此答案):

If you have several json files you should concat the DataFrames together (similar to in this answer):

pd.concat([pd.read_json(file) for file in ...], ignore_index=True)

此示例的原始答案:

在正则表达式中使用后缀作为传递给read_csv的分隔符:

Original answer for this example:

Use a lookbehind in the regex for the separator passed to read_csv:

In [11]: df = pd.read_csv('foo.csv', sep='(?<!,)\s', header=None)

In [12]: df
Out[12]: 
       0               1                                                  2
0   8988  20131126102946                                                 []
1   8989  20131126102946  [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
2   8988  20131126102946                                                 []
3   8989  20131126102946  [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
4    544  20131126102946  [178, 30, 176, 103, 179, 112, 21, 20, 48, 7, 5...
5    601  20131126094911                                                 []
6    602  20131126101056                                                 []
7    603  20131126101056                                                 []
8    604  20131126101056                                                 []
9    544  20131126102946  [178, 30, 176, 103, 179, 112, 21, 20, 48, 7, 5...
10   601  20131126094911                                                 []
11   602  20131126101056                                                 []
12   603  20131126101056                                                 []
13   604  20131126101056                                                 []

[14 rows x 3 columns]

如评论中所述,您可以通过将多个Series并置在一起来更直接地执行此操作...它也将更容易理解:

As mentioned in the comments you may be able to do this more directly by concat several Series together... It's also going to be a little easier to follow:

def series_chunk(chunk):
    jfile = json.loads(chunk)
    zipcode = jfile['zipcode']
    datetime = jfile['current']['proc_time']
    triggers = jfile['triggers']
    return pd.Series([jfile['zipcode'], jfile['current']['proc_time'], jfile['triggers']])

dfs = []
for fin in glob.glob('*.txt'):
    with open(fin) as f:
        df = pd.concat([series_chunk(chunk) for chunk in lines_per_n(f, 5)], axis=1)
        dfs.append(dfs)

df = pd.concat(dfs, ignore_index=True)

注意:您也可以将try/except移到series_chunk.

这篇关于从JSON对象创建 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆