lxml解析器吃掉所有内存 [英] lxml parser eats all memory

查看：51 发布时间：2021/12/31 0:01:50 python memory-leaks lxml

本文介绍了lxml解析器吃掉所有内存的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在用 python 编写一些蜘蛛程序，并使用 lxml 库来解析 html 和 gevent 库以实现异步.我发现经过一段时间的工作后，lxml 解析器开始占用高达 8GB 的内存(所有服务器内存).但是我只有 100 个异步线程，每个线程最多将文档解析为 300kb.

I'm writing some spider in python and use lxml library for parsing html and gevent library for async. I found that after sometime of work lxml parser starts eats memory up to 8GB(all server memory). But i have only 100 async threads each of them parse document max to 300kb.

我已经测试并发现该问题始于 lxml.html.fromstring，但我无法重现此问题.

i'v tested and get that problem starts in lxml.html.fromstring, but i can't reproduce this problem.

这行代码的问题:

HTML = lxml.html.fromstring(htmltext)

也许有人知道它可以是什么，或者解决这个问题?

Maybe someone know what it can be, or hoe to fix this?

感谢您的帮助.

附言

Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64    GNU/Linux
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

向上:

我为使用 lxml 解析器的进程设置了 ulimit -Sv 500000 和 uliit -Sm 615000.

i set ulimit -Sv 500000 and uliit -Sm 615000 for processes that use lxml parser.

现在过了一段时间他们开始写错误日志:

And now in with some time they start writing in error log:

异常 MemoryError: MemoryError() in 'lxml.etree._BaseErrorLog._receive' 被忽略".

"Exception MemoryError: MemoryError() in 'lxml.etree._BaseErrorLog._receive' ignored".

并且我无法捕获此异常，因此它会在日志中递归写入此消息，直到磁盘上有可用空间.

And i can't catch this exception so it writes recursively in log this message untile there is free space on disk.

我怎样才能捕捉到这个异常来杀死进程以便守护进程可以创建新的进程??

How can i catch this exception to kill process so daemon can create new one??

lxml解析器吃掉所有内存 [英] lxml parser eats all memory

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

lxml解析器吃掉所有内存 [英] lxml parser eats all memory

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭