Python的高内存使用率与BeautifulSoup [英] Python high memory usage with BeautifulSoup

查看:463
本文介绍了Python的高内存使用率与BeautifulSoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图处理多个网页与BeautifulSoup4在Python 2.7.3,但以后每解析内存使用量上升了起来。

I was trying to process several web pages with BeautifulSoup4 in python 2.7.3 but after every parse the memory usage goes up and up.

这简化code产生相同的行为:

This simplified code produces the same behavior:

from bs4 import BeautifulSoup

def parse():
    f = open("index.html", "r")
    page = BeautifulSoup(f.read(), "lxml")
    f.close()

while True:
    parse()
    raw_input()

在调用解析()后,五次蟒蛇过程中​​已经使用了30 MB的内存(使用HTML文件是大约100 KB),它由4 MB每次调用上升。
是否有一个释放内存或某种变通方式?

After calling parse() for five times the python process already uses 30 MB of memory (used HTML file was around 100 kB) and it goes up by 4 MB every call. Is there a way to free that memory or some kind of workaround?

更新:
这种行为让我头痛。这code容易使用了大量的内存,即使BeautifulSoup变量应长删除:

Update: This behavior gives me headaches. This code easily uses up plenty of memory even though the BeautifulSoup variable should be long deleted:

from bs4 import BeautifulSoup
import threading, httplib, gc

class pageThread(threading.Thread):
    def run(self):
        con = httplib.HTTPConnection("stackoverflow.com")
        con.request("GET", "/")
        res = con.getresponse()
        if res.status == 200:
            page = BeautifulSoup(res.read(), "lxml")
        con.close()

def load():
    t = list()
    for i in range(5):
        t.append(pageThread())
        t[i].start()
    for thread in t:
        thread.join()

while not raw_input("load? "):
    gc.collect()
    load()

莫非是某种错误的可能?

Could that be some kind of a bug maybe?

推荐答案

尝试美味的汤的的分解功能,这破坏了树,当你做与每个文件的工作。

Try Beautiful Soup's decompose functionality, which destroys the tree, when you're done working with each file.

from bs4 import BeautifulSoup

def parse():
    f = open("index.html", "r")
    page = BeautifulSoup(f.read(), "lxml")
    # page extraction goes here
    page.decompose()
    f.close()

while True:
    parse()
    raw_input()

这篇关于Python的高内存使用率与BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆