使用python pandas输入合并许多json字符串 [英] Merge Many json strings with python pandas inputs
问题描述
我创建了由以下对象组成的数据对象:熊猫
对象例如 DataFrame
s和 Panel
s。我希望将这些对象序列化为 json
,速度是主要考虑因素。
I have created data objects that are comprised of (among other things), of pandas
objects like DataFrame
s and Panel
s. I'm looking to serialize these objects into json
, and speed is a primary consideration.
例如,我有一个类似这样的面板:
Say for instance I have a panel like so:
In [54]: panel = pandas.Panel(
numpy.random.randn(5, 100, 10),
items = ['a', 'b', 'c', 'd', 'e'],
major_axis = pandas.DatetimeIndex(start = '01/01/2000',
freq = 'b',
periods = 100
),
minor_axis = ['z', 'y', 'x', 'v', 'u', 't', 's', 'r', 'q', 'o']
)
In [64]: panel
Out[64]:
<class 'pandas.core.panel.Panel'>
Dimensions: 5 (items) x 100 (major_axis) x 10 (minor_axis)
Items axis: a to e
Major_axis axis: 2000-01-03 00:00:00 to 2000-05-19 00:00:00
Minor_axis axis: z to o
我d想将此面板
转换为扁平的 json
And I'd like to turn this panel
into flattened json
注意:我正在使用更复杂的对象执行此操作,但是循环遍历键并为每个键生成 json
数据的总体逻辑是相同的
NOTE: I'm doing this with more complicated objects, but the overall logic of looping over keys and generating json
data for each key is the same
我可以编写一个快速而肮脏的 panel_to_json()
函数,如下所示:
I can write a quick and dirty panel_to_json()
function like so:
def panel_to_json(panel):
d = {'__type__' : 'panel'}
for item in panel.items:
tmp = panel.loc[item ,: , :].to_json()
d[item] = eval(tmp)
return json.dumps(d)
In [58]: tmp = panel_to_json(panel)
In [59]: tmp[:100]
Out[59]: '{"a": {"q": {"948931200000": -0.5586319118, "951955200000": 0.6820748888, "949363200000": -0.0153867'
哪一个让我得到了正确的结果,问题是 eval
的使用费用非常。例如,如果我删除 eval
并仅处理由于<$而导致的 \\
的少量问题c $ c> panel_no_eval_to_json 函数在这里:
Which gets me the correct result, the problem is the eval
usage is very costly. For example, if I remove the eval
and just deal with the smattering of \\
that result from panel_no_eval_to_json
function here:
def panel_no_eval_to_json(panel):
d = {'__type__' : 'panel'}
for item in panel.items:
d[item] = panel.loc[item ,: , :].to_json()
return json.dumps(d)
In [60]: tmp = panel_no_eval_to_json(panel)
In [61]: tmp[:100]
Out[61]: '{"a": "{\\"z\\":{\\"946857600000\\":1.0233515965,\\"946944000000\\":-1.1333560575,\\"947030400000\\":-0.0072'
速度差异很大,签出它们的%timeit
值!!:
The difference in speed is substantial, checkout their %timeit
values!!:
In [62]: %timeit panel_no_eval_to_json(panel)
100 loops, best of 3: 3.55 ms per loop
In [63]: %timeit panel_to_json(panel)
10 loops, best of 3: 41.1 ms per loop
最终目标
所以我的最终目标是遍历 Panel
(或我的对象,它具有不同的键/属性,其中许多是 Panel
和 DataFrame
),然后合并通过调用<$创建的 json
流c $ c> to_json()到聚集的 json
流(实际上是我的数据对象的扁平化数据表示)中,就像执行通过使用上面的 panel_to_json
函数(一个 with eval
)。
End Goal
So my final goal would be to loop through the Panel
(or my object, that has different keys / attributes, many of which are Panel
's and DataFrame
s), and merge the json
streams created from invoking to_json()
into an aggregated json
stream (which would actually be the flattening data representation of my data object) just as is performed by using the panel_to_json
function above (the one with eval
).
我的主要目标是:
- 利用现有的
pandas to_json
功能 - 利用加速和现有库(我可以编写自己的
json_stream_merger
,但显然
- Leverage existing
pandas to_json
functionality - Leverage speedups and existing libraries (I could write my own
json_stream_merger
, but clearly this has already been done, right?)
推荐答案
最后,最快的方法是编写一个简单的字符串 concat
-er。以下是两个最佳解决方案(由@Skorp提供)和它们各自的%timeit
时间以图形形式显示
In the end, the fastest way was to write a simple string concat
-er. Here were the two best solutions, (one provided by @Skorp)) and their respective %timeit
times in graphical form
def panel_to_json_string(panel):
def __merge_stream(key, stream):
return '"' + key + '"' + ': ' + stream + ', '
try:
stream = '{ "__type__": "panel", '
for item in panel.items:
stream += __merge_stream(item, panel.loc[item, :, :].to_json())
# take out extra last comma
stream = stream[:-2]
# add the final paren
stream += '}'
except:
logging.exception('Panel Encoding did not work')
return stream
方法2。
Method 2. Loads-Dumps
def panel_to_json_loads(panel):
try:
d = {'__type__' : 'panel'}
for item in panel.items:
d[item] = json.loads(panel.loc[item ,: , :].to_json())
return json.dumps(d)
except:
logging.exception('Panel Encoding did not work')
问题设置
Problem Setup
import timeit
import pandas
import numpy
setup = ("import strat_check.io as sio; import pandas; import numpy;"
"panel = pandas.Panel(numpy.random.randn(5, {0}, 4), "
"items = ['a', 'b', 'c', 'd', 'e'], "
"major_axis = pandas.DatetimeIndex(start = '01/01/1990',"
"freq = 's', "
"periods = {0}), "
"minor_axis = numpy.arange(4))")
vals = [10, 100, 1000, 10000, 100000]
d = {'string-merge': [],
'loads-dumps': []
}
for n in vals:
number = 10
d['string-merge'].append(
timeit.timeit(stmt = 'panel_to_json_string(panel)',
setup = setup.format(n),
number = number)
)
d['loads-dumps'].append(
timeit.timeit(stmt = 'sio.panel_to_json_loads(panel)',
setup = setup.format(n),
number = number)
)
< img src = https://i.stack.imgur.com/37Q8r.png alt =在此处输入图片描述>
这篇关于使用python pandas输入合并许多json字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!