lxml解析器会消耗所有内存 [英] lxml parser eats all memory

查看：91 发布时间：2020/5/4 8:22:06 python memory-leaks lxml

本文介绍了lxml解析器会消耗所有内存的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在用python写蜘蛛，并使用lxml库解析html和使用gevent库进行异步.我发现经过一段时间的工作后，lxml解析器开始占用高达8GB的内存(所有服务器内存).但是我只有100个异步线程，每个线程都将文档最大解析为300kb.

I'm writing some spider in python and use lxml library for parsing html and gevent library for async. I found that after sometime of work lxml parser starts eats memory up to 8GB(all server memory). But i have only 100 async threads each of them parse document max to 300kb.

我已经测试过，并且该问题始于lxml.html.fromstring，但我无法重现此问题.

i'v tested and get that problem starts in lxml.html.fromstring, but i can't reproduce this problem.

这行代码中的问题:

HTML = lxml.html.fromstring(htmltext)

也许有人知道这可能是什么，还是不愿意解决此问题?

Maybe someone know what it can be, or hoe to fix this?

感谢帮助.

P.S.

Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64    GNU/Linux
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

上:

我为使用lxml解析器的进程设置了ulimit -Sv 500000和uliit -Sm 615000.

i set ulimit -Sv 500000 and uliit -Sm 615000 for processes that use lxml parser.

现在他们开始写一些错误日志了:

And now in with some time they start writing in error log:

异常MemoryError:'lxml.etree._BaseErrorLog._receive'中的MemoryError()被忽略".

"Exception MemoryError: MemoryError() in 'lxml.etree._BaseErrorLog._receive' ignored".

而且我无法捕获此异常，因此它将递归地写入此消息，直到磁盘上有可用空间为止.

And i can't catch this exception so it writes recursively in log this message untile there is free space on disk.

我如何捕获此异常以终止进程，以便守护程序可以创建新的异常?

How can i catch this exception to kill process so daemon can create new one??

lxml解析器会消耗所有内存 [英] lxml parser eats all memory

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

lxml解析器会消耗所有内存 [英] lxml parser eats all memory

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭