lxml解析器吃掉所有内存 [英] lxml parser eats all memory

查看:51
本文介绍了lxml解析器吃掉所有内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用 python 编写一些蜘蛛程序,并使用 lxml 库来解析 html 和 gevent 库以实现异步.我发现经过一段时间的工作后,lxml 解析器开始占用高达 8GB 的​​内存(所有服务器内存).但是我只有 100 个异步线程,每个线程最多将文档解析为 300kb.

I'm writing some spider in python and use lxml library for parsing html and gevent library for async. I found that after sometime of work lxml parser starts eats memory up to 8GB(all server memory). But i have only 100 async threads each of them parse document max to 300kb.

我已经测试并发现该问题始于 lxml.html.fromstring,但我无法重现此问题.

i'v tested and get that problem starts in lxml.html.fromstring, but i can't reproduce this problem.

这行代码的问题:

HTML = lxml.html.fromstring(htmltext)

也许有人知道它可以是什么,或者解决这个问题?

Maybe someone know what it can be, or hoe to fix this?

感谢您的帮助.

附言

Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64    GNU/Linux
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

向上:

我为使用 lxml 解析器的进程设置了 ulimit -Sv 500000 和 uliit -Sm 615000.

i set ulimit -Sv 500000 and uliit -Sm 615000 for processes that use lxml parser.

现在过了一段时间他们开始写错误日志:

And now in with some time they start writing in error log:

异常 MemoryError: MemoryError() in 'lxml.etree._BaseErrorLog._receive' 被忽略".

"Exception MemoryError: MemoryError() in 'lxml.etree._BaseErrorLog._receive' ignored".

并且我无法捕获此异常,因此它会在日志中递归写入此消息,直到磁盘上有可用空间.

And i can't catch this exception so it writes recursively in log this message untile there is free space on disk.

我怎样才能捕捉到这个异常来杀死进程以便守护进程可以创建新的进程??

How can i catch this exception to kill process so daemon can create new one??

推荐答案

您可能会保留一些参考资料,使文档保持活力.小心来自 xpath 评估的字符串结果,例如:默认情况下,它们是智能"字符串,提供对包含元素的访问,因此如果您保留对它们的引用,则将树保留在内存中.请参阅有关 xpath 返回值的文档:

You might be keeping some references which keep the documents alive. Be careful with string results from xpath evaluation for example: by default they are "smart" strings, which provide access to the containing element, thus keeping the tree in memory if you keep a reference to them. See the docs on xpath return values:

在某些情况下,智能字符串行为是不可取的.例如,这意味着树将通过字符串保持活动状态,在字符串值是树中唯一真正感兴趣的东西的情况下,这可能会对内存产生相当大的影响.对于这些情况,您可以使用关键字参数 smart_strings 停用父母关系.

There are certain cases where the smart string behaviour is undesirable. For example, it means that the tree will be kept alive by the string, which may have a considerable memory impact in the case that the string value is the only thing in the tree that is actually of interest. For these cases, you can deactivate the parental relationship using the keyword argument smart_strings.

(我不知道这是否是你的问题,但它是一个候选人.我自己也被这个咬过一次;-))

(I have no idea if this is the problem in your case, but it's a candidate. I've been bitten by this myself once ;-))

这篇关于lxml解析器吃掉所有内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆