是否有最新的带有 python 绑定的快速 YAML 解析器? [英] Is there an up-to-date fast YAML parser with python bindings?

查看:50
本文介绍了是否有最新的带有 python 绑定的快速 YAML 解析器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Python 中进行快速 YAML 解析的最新和最棒的是什么?Syck 过时并推荐使用 PyYaml,但 PyYaml 速度很慢,并且存在 GIL 问题:

<预><代码>>>>def xit(f, x):进口螺纹对于 xrange(x) 中的 i:threading.Thread(target=f).start()>>>定义压力():开始 = time.time()res = yaml.load(open(path_to_11000_byte_yaml_file))打印 "Took %.2fs" % (time.time() - start,)>>>xit(stressit, 1)耗时 0.37 秒>>>xit(stressit, 2)花了 1.40 秒耗时 1.41 秒>>>xit(stressit, 4)花了 2.98 秒花了 2.98 秒耗时 2.99 秒花了 3.00 秒

鉴于我的用例,我可以缓存解析的对象,但我仍然更喜欢更快的解决方案.

解决方案

链接的 wiki 页面在警告使用 libyaml (c) 和 PyYaml (python)"之后声明.尽管该注释确实有一个错误的维基链接(应该是 PyYAML 而不是 PyYaml).

至于性能,根据您安装 PyYAML 的方式,您应该有 CParser 类可用,它实现了用优化的 C 编写的 YAML 解析器.虽然我认为这不能解决 GIL 问题,但它明显更快.以下是我在我的机器上运行的一些粗略基准测试(AMD Athlon II X4 640、3.0GHz、8GB RAM):

首先使用默认的纯 Python 解析器:

$/usr/bin/python2 -m timeit -s 'import yaml;y=file("large.yaml", "r").read()' \'yaml.load(y)'10 个循环,最好的 3 个:每个循环 405 毫秒

使用 CParser:

$/usr/bin/python2 -m timeit -s 'import yaml;y=file("large.yaml", "r").read()' \'yaml.load(y, Loader=yaml.CLoader)'10 个循环,最好的 3 个:每个循环 59.2 毫秒

为了比较,使用纯 Python 解析器的 PyPy.

$ pypy -m timeit -s 'import yaml;y=file("large.yaml", "r").read()' \'yaml.load(y)'10 个循环,最好的 3 个:每个循环 101 毫秒

对于 large.yaml 我只是在谷歌上搜索大 yaml 文件"并发现了这个:

https://gist.github.com/nrh/667383/raw/1b3ba75c939f2886f63291528d.yaml

(我必须删除前几行以使其成为单文档 YAML 文件,否则 yaml.load 会抱怨.)

另一件需要考虑的事情是使用 multiprocessing 模块而不是线程.这解决了 GIL 问题,但确实需要更多样板代码来在进程之间进行通信.尽管有许多好的库可以使多处理更容易.这里有一个很好的列表.

What's the latest and greatest for fast YAML parsing in Python? Syck is out of date and recommends using PyYaml, yet PyYaml is pretty slow, and suffers from the GIL problem:

>>> def xit(f, x):
        import threading
        for i in xrange(x):
                threading.Thread(target=f).start()

>>> def stressit():
        start = time.time()
        res = yaml.load(open(path_to_11000_byte_yaml_file))
        print "Took %.2fs" % (time.time() - start,)    

>>> xit(stressit, 1)
Took 0.37s
>>> xit(stressit, 2)
Took 1.40s
Took 1.41s
>>> xit(stressit, 4)
Took 2.98s
Took 2.98s
Took 2.99s
Took 3.00s

Given my use case I can cache the parsed objects, but I'd still prefer a faster solution even for that.

解决方案

The linked wiki page states after the warning "Use libyaml (c), and PyYaml (python)". Although the note does have a bad wikilink (should be PyYAML not PyYaml).

As for performance, depending on how you installed PyYAML you should have the CParser class available which implements a YAML parser written in optimized C. While I don't think this gets around the GIL issue, it is markedly faster. Here are a few cursory benchmarks I ran on my machine (AMD Athlon II X4 640, 3.0GHz, 8GB RAM):

First with the default pure-Python parser:

$ /usr/bin/python2 -m timeit -s 'import yaml; y=file("large.yaml", "r").read()' \
    'yaml.load(y)'                    
10 loops, best of 3: 405 msec per loop

With the CParser:

$ /usr/bin/python2 -m timeit -s 'import yaml; y=file("large.yaml", "r").read()' \
    'yaml.load(y, Loader=yaml.CLoader)'
10 loops, best of 3: 59.2 msec per loop

And, for comparison, with PyPy using the pure-Python parser.

$ pypy -m timeit -s 'import yaml; y=file("large.yaml", "r").read()' \
    'yaml.load(y)'
10 loops, best of 3: 101 msec per loop

For large.yaml I just googled for "large yaml file" and came across this:

https://gist.github.com/nrh/667383/raw/1b3ba75c939f2886f63291528df89418621548fd/large.yaml

(I had to remove the first couple of lines to make it a single-doc YAML file otherwise yaml.load complains.)

EDIT:

Another thing to consider is using the multiprocessing module instead of threads. This gets around GIL problems, but does require a bit more boiler-plate code to communicate between the processes. There are a number of good libraries available though to make multiprocessing easier. There's a pretty good list of them here.

这篇关于是否有最新的带有 python 绑定的快速 YAML 解析器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆