JSON编码的非常长的迭代器 [英] JSON-encoding very long iterators

查看:76
本文介绍了JSON编码的非常长的迭代器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个Web服务,该服务返回包含以JSON编码的很长列表的对象.当然,我们要使用迭代器而不是Python列表,以便我们可以从数据库中流式传输对象.不幸的是,标准库(json.JSONEncoder)中的JSON编码器仅接受要转换为JSON列表的列表和元组(尽管_iterencode_list看起来实际上可以在任何可迭代的版本上使用.)

I am writing a web service which returns objects containing very long lists, encoded in JSON. Of course we want to use iterators rather than Python lists so we can stream the objects from a database; unfortunately, the JSON encoder in the standard library (json.JSONEncoder) only accepts lists and tuples to be converted to JSON lists (though _iterencode_list looks like it would actually work on any iterable).

文档字符串建议覆盖默认值以将对象转换为列表,但这意味着我们失去了流式传输的好处.以前,我们覆盖了私有方法,但是(如预期的那样)在重构编码器时中断了.

The docstrings suggest overriding default to convert the object to a list, but this means we lose the benefits of streaming. Previously, we overrode a private method, but (as could have been expected) that broke when the encoder was refactored.

在Python中以流方式将迭代器序列化为JSON列表的最佳方法是什么?

What is the best way to serialize iterators as JSON lists in Python in a streaming way?

推荐答案

我正是需要这个.第一种方法是覆盖JSONEncoder.iterencode()方法.但是,这不起作用,因为一旦迭代器不是顶级迭代器,某些_iterencode()函数的内部结构就会接管.

I needed exactly this. First approach was override the JSONEncoder.iterencode() method. However this does not work because as soon as the iterator is not toplevel, the internals of some _iterencode() function take over.

在研究了代码之后,我发现了一个非常hacky的解决方案,但是它可以工作.仅限Python 3,但我确信python 2(只是其他魔术方法名称)可以实现相同的魔术效果:

After some studying of the code, I found a very hacky solution, but it works. Python 3 only, but I'm sure the same magic is possible with python 2 (just other magic-method names):

import collections.abc
import json
import itertools
import sys
import resource
import time
starttime = time.time()
lasttime = None


def log_memory():
    if "linux" in sys.platform.lower():
        to_MB = 1024
    else:
        to_MB = 1024 * 1024
    print("Memory: %.1f MB, time since start: %.1f sec%s" % (
        resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / to_MB,
        time.time() - starttime,
        "; since last call: %.1f sec" % (time.time() - lasttime) if lasttime
        else "",
    ))
    globals()["lasttime"] = time.time()


class IterEncoder(json.JSONEncoder):
    """
    JSON Encoder that encodes iterators as well.
    Write directly to file to use minimal memory
    """
    class FakeListIterator(list):
        def __init__(self, iterable):
            self.iterable = iter(iterable)
            try:
                self.firstitem = next(self.iterable)
                self.truthy = True
            except StopIteration:
                self.truthy = False

        def __iter__(self):
            if not self.truthy:
                return iter([])
            return itertools.chain([self.firstitem], self.iterable)

        def __len__(self):
            raise NotImplementedError("Fakelist has no length")

        def __getitem__(self, i):
            raise NotImplementedError("Fakelist has no getitem")

        def __setitem__(self, i):
            raise NotImplementedError("Fakelist has no setitem")

        def __bool__(self):
            return self.truthy

    def default(self, o):
        if isinstance(o, collections.abc.Iterable):
            return type(self).FakeListIterator(o)
        return super().default(o)

print(json.dumps((i for i in range(10)), cls=IterEncoder))
print(json.dumps((i for i in range(0)), cls=IterEncoder))
print(json.dumps({"a": (i for i in range(10))}, cls=IterEncoder))
print(json.dumps({"a": (i for i in range(0))}, cls=IterEncoder))


log_memory()
print("dumping 10M numbers as incrementally")
with open("/dev/null", "wt") as fp:
    json.dump(range(10000000), fp, cls=IterEncoder)
log_memory()
print("dumping 10M numbers built in encoder")
with open("/dev/null", "wt") as fp:
    json.dump(list(range(10000000)), fp)
log_memory()

结果:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]
{"a": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}
{"a": []}
Memory: 8.4 MB, time since start: 0.0 sec
dumping 10M numbers as incrementally
Memory: 9.0 MB, time since start: 8.6 sec; since last call: 8.6 sec
dumping 10M numbers built in encoder
Memory: 395.5 MB, time since start: 17.1 sec; since last call: 8.5 sec

很明显,IterEncoder在保持相同编码速度的同时,不需要内存来存储10M int.

It's clear to see that the IterEncoder does not need the memory to store 10M ints, while keeping the same encoding speed.

(hacky)技巧是_iterencode_list实际上不需要任何列表项.它只是想知道列表是否为空(__bool__),然后获取其迭代器.但是,仅当isinstance(x, (list, tuple))返回True时,此代码才可用.因此,我将迭代器包装到一个列表子类中,然后禁用所有随机访问,将第一个元素放在前面,以便我知道它是否为空,并反馈迭代器.然后,default方法在使用迭代器的情况下将返回此伪列表.

The (hacky) trick is that the _iterencode_list actually doesn't need any of the list things. It just wants to know if the list is empty (__bool__) and then get its iterator. However it only gets to this code when isinstance(x, (list, tuple)) returns True. So I'm packaging the iterator into a list-subclass, then disabling all the random-access, getting the first element up front so that I know whether it's empty or not, and feeding back the iterator. Then the default method returns this fake list in case of an iterator.

这篇关于JSON编码的非常长的迭代器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆