使用python pandas输入合并许多json字符串 [英] Merge Many json strings with python pandas inputs

查看:229
本文介绍了使用python pandas输入合并许多json字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了由以下对象组成的数据对象:熊猫对象例如 DataFrame s和 Panel s。我希望将这些对象序列化为 json ,速度是主要考虑因素。

I have created data objects that are comprised of (among other things), of pandas objects like DataFrames and Panels. I'm looking to serialize these objects into json, and speed is a primary consideration.

例如,我有一个类似这样的面板:

Say for instance I have a panel like so:

In [54]: panel = pandas.Panel( 
             numpy.random.randn(5, 100, 10), 
             items = ['a', 'b', 'c', 'd', 'e'], 
             major_axis = pandas.DatetimeIndex(start = '01/01/2000', 
                                               freq = 'b', 
                                               periods = 100
             ), 
             minor_axis = ['z', 'y', 'x', 'v', 'u', 't', 's', 'r', 'q', 'o']
          )
In [64]: panel
Out[64]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 5 (items) x 100 (major_axis) x 10 (minor_axis)
Items axis: a to e
Major_axis axis: 2000-01-03 00:00:00 to 2000-05-19 00:00:00
Minor_axis axis: z to o

我d想将此面板转换为扁平的 json

And I'd like to turn this panel into flattened json

注意:我正在使用更复杂的对象执行此操作,但是循环遍历键并为每个键生成 json 数据的总体逻辑是相同的

NOTE: I'm doing this with more complicated objects, but the overall logic of looping over keys and generating json data for each key is the same

我可以编写一个快速而肮脏的 panel_to_json()函数,如下所示:

I can write a quick and dirty panel_to_json() function like so:

def panel_to_json(panel):

    d = {'__type__' : 'panel'}
    for item in panel.items:
        tmp = panel.loc[item ,: , :].to_json()
        d[item] = eval(tmp)
    return json.dumps(d)

In [58]: tmp = panel_to_json(panel)
In [59]: tmp[:100]
Out[59]: '{"a": {"q": {"948931200000": -0.5586319118, "951955200000": 0.6820748888, "949363200000": -0.0153867'

哪一个让我得到了正确的结果,问题是 eval 的使用费用非常。例如,如果我删除 eval 并仅处理由于<$而导致的 \\ 的少量问题c $ c> panel_no_eval_to_json 函数在这里:

Which gets me the correct result, the problem is the eval usage is very costly. For example, if I remove the eval and just deal with the smattering of \\ that result from panel_no_eval_to_json function here:

def panel_no_eval_to_json(panel):
    d = {'__type__' : 'panel'}
    for item in panel.items:
        d[item] = panel.loc[item ,: , :].to_json()
    return json.dumps(d)

In [60]: tmp = panel_no_eval_to_json(panel)

In [61]: tmp[:100]
Out[61]: '{"a": "{\\"z\\":{\\"946857600000\\":1.0233515965,\\"946944000000\\":-1.1333560575,\\"947030400000\\":-0.0072'

速度差异很大,签出它们的%timeit 值!!:

The difference in speed is substantial, checkout their %timeit values!!:

In [62]: %timeit panel_no_eval_to_json(panel)
100 loops, best of 3: 3.55 ms per loop

In [63]: %timeit panel_to_json(panel)
10 loops, best of 3: 41.1 ms per loop



最终目标



所以我的最终目标是遍历 Panel (或我的对象,它具有不同的键/属性,其中许多是 Panel DataFrame ),然后合并通过调用<$创建的 json 流c $ c> to_json()到聚集的 json 流(实际上是我的数据对象的扁平化数据表示)中,就像执行通过使用上面的 panel_to_json 函数(一个 with eval )。

End Goal

So my final goal would be to loop through the Panel (or my object, that has different keys / attributes, many of which are Panel's and DataFrames), and merge the json streams created from invoking to_json() into an aggregated json stream (which would actually be the flattening data representation of my data object) just as is performed by using the panel_to_json function above (the one with eval).

我的主要目标是:


  1. 利用现有的 pandas to_json 功能

  2. 利用加速和现有库(我可以编写自己的 json_stream_merger ,但显然

  1. Leverage existing pandas to_json functionality
  2. Leverage speedups and existing libraries (I could write my own json_stream_merger, but clearly this has already been done, right?)


推荐答案

最后,最快的方法是编写一个简单的字符串 concat -er。以下是两个最佳解决方案(由@Skorp提供)和它们各自的%timeit 时间以图形形式显示

In the end, the fastest way was to write a simple string concat-er. Here were the two best solutions, (one provided by @Skorp)) and their respective %timeit times in graphical form

def panel_to_json_string(panel):
    def __merge_stream(key, stream):
        return '"' + key + '"' + ': ' + stream + ', '

    try:
        stream = '{ "__type__": "panel", '
        for item in panel.items:
            stream += __merge_stream(item, panel.loc[item, :, :].to_json()) 

        # take out extra last comma
        stream = stream[:-2] 

        # add the final paren
        stream += '}'
    except:
        logging.exception('Panel Encoding did not work')
return stream



方法2。



Method 2. Loads-Dumps

def panel_to_json_loads(panel):
    try:
        d = {'__type__' : 'panel'}

        for item in panel.items:
            d[item] = json.loads(panel.loc[item ,: , :].to_json())
        return json.dumps(d)
    except:
        logging.exception('Panel Encoding did not work')



问题设置



Problem Setup

import timeit
import pandas
import numpy

setup = ("import strat_check.io as sio; import pandas; import numpy;" 
     "panel = pandas.Panel(numpy.random.randn(5, {0}, 4), "
     "items = ['a', 'b', 'c', 'd', 'e'], " 
     "major_axis = pandas.DatetimeIndex(start = '01/01/1990',"
                                        "freq = 's', "
                                        "periods = {0}), "
                                        "minor_axis = numpy.arange(4))")

vals = [10, 100, 1000, 10000, 100000]

d = {'string-merge': [], 
     'loads-dumps': []
     }

for n in vals:
    number = 10

d['string-merge'].append(
    timeit.timeit(stmt = 'panel_to_json_string(panel)', 
                  setup = setup.format(n), 
                  number = number)
)

d['loads-dumps'].append(
    timeit.timeit(stmt = 'sio.panel_to_json_loads(panel)', 
                  setup = setup.format(n), 
                  number = number)
)

< img src = https://i.stack.imgur.com/37Q8r.png alt =在此处输入图片描述>

这篇关于使用python pandas输入合并许多json字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆