lxml解析器会消耗所有内存 [英] lxml parser eats all memory

查看:91
本文介绍了lxml解析器会消耗所有内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用python写蜘蛛,并使用lxml库解析html和使用gevent库进行异步.我发现经过一段时间的工作后,lxml解析器开始占用高达8GB的内存(所有服务器内存).但是我只有100个异步线程,每个线程都将文档最大解析为300kb.

I'm writing some spider in python and use lxml library for parsing html and gevent library for async. I found that after sometime of work lxml parser starts eats memory up to 8GB(all server memory). But i have only 100 async threads each of them parse document max to 300kb.

我已经测试过,并且该问题始于lxml.html.fromstring,但我无法重现此问题.

i'v tested and get that problem starts in lxml.html.fromstring, but i can't reproduce this problem.

这行代码中的问题:

HTML = lxml.html.fromstring(htmltext)

也许有人知道这可能是什么,还是不愿意解决此问题?

Maybe someone know what it can be, or hoe to fix this?

感谢帮助.

P.S.

Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64    GNU/Linux
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

:

我为使用lxml解析器的进程设置了ulimit -Sv 500000和uliit -Sm 615000.

i set ulimit -Sv 500000 and uliit -Sm 615000 for processes that use lxml parser.

现在他们开始写一些错误日志了:

And now in with some time they start writing in error log:

异常MemoryError:'lxml.etree._BaseErrorLog._receive'中的MemoryError()被忽略".

"Exception MemoryError: MemoryError() in 'lxml.etree._BaseErrorLog._receive' ignored".

而且我无法捕获此异常,因此它将递归地写入此消息,直到磁盘上有可用空间为止.

And i can't catch this exception so it writes recursively in log this message untile there is free space on disk.

我如何捕获此异常以终止进程,以便守护程序可以创建新的异常?

How can i catch this exception to kill process so daemon can create new one??

推荐答案

您可能会保留一些使文档保持活动状态的引用.例如,请小心xpath评估中的字符串结果:默认情况下,它们是智能"字符串,可提供对包含元素的访问,因此,如果保留对它们的引用,则将树保留在内存中.请参见 xpath返回值:

You might be keeping some references which keep the documents alive. Be careful with string results from xpath evaluation for example: by default they are "smart" strings, which provide access to the containing element, thus keeping the tree in memory if you keep a reference to them. See the docs on xpath return values:

在某些情况下,不希望使用智能字符串的行为.例如,这意味着字符串将使树保持活动状态,如果字符串值是树中唯一真正感兴趣的内容,则这可能会对内存产生重大影响.在这种情况下,您可以使用关键字参数smart_strings停用父母关系.

There are certain cases where the smart string behaviour is undesirable. For example, it means that the tree will be kept alive by the string, which may have a considerable memory impact in the case that the string value is the only thing in the tree that is actually of interest. For these cases, you can deactivate the parental relationship using the keyword argument smart_strings.

(我不知道这是否是您的问题,但这是一个候选人.我曾经被我自己咬过一次;-))

(I have no idea if this is the problem in your case, but it's a candidate. I've been bitten by this myself once ;-))

这篇关于lxml解析器会消耗所有内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆