Dryscrape / webkit_server内存泄漏 [英] Dryscrape/webkit_server memory leak

查看:321
本文介绍了Dryscrape / webkit_server内存泄漏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



进程webkit_server的内存使用情况似乎随着对session.visit()的每次调用而增加。 。它发生在我身上,使用以下脚本:

 导入dryscrape 

用于网址中的网址:
session = dryscrape.Session()
session.set_timeout(10)
session.set_attribute('auto_load_images',False)
session.visit(url)
response = session .body()

我正在迭代约。 300个网址和70-80个网址后webkit_server占用大约3GB的内存。然而,这对我来说并不是真正的问题,但似乎dryscrape / webkit_server在每次迭代中都会变得越来越慢。在上述70-80次迭代之后,dryscrape非常慢,导致超时错误(设置timeout = 10秒),我需要中止python脚本。重新启动webkit_server(例如每30次迭代后)可能会有所帮助,并且会清空内存,但是我不确定'内存泄漏'是否真的有助于干刮越来越慢。



有人知道如何重新启动webkit_server,以便测试吗?



我还没有找到可以接受的解决方法,我想切换到另一个解决方案(硒/ phantomjs,ghost.py),因为我简单地喜欢dryscrape。 Dryscrape工作顺利。如果一个人没有在一个会话中迭代太多的url。



这个问题也在这里讨论

< a href =https://github.com/niklasb/dryscrape/issues/41 =nofollow noreferrer> https://github.com/niklasb/dryscrape/issues/41



和这里



Webkit_server(从python的dryscrape调用)访问每个页面使用越来越多的内存。我如何减少使用的内存?

解决方案

我有与内存泄漏相同的问题。通过在每个页面视图后重新设置会话来解决这个问题!



简化的工作流程将如下所示。



设置服务器:

  dryscrape.start_xvfb()
sess = dryscrape.Session()



 

然后迭代Url并在每个url后重置会话c $ c> for url in url:
sess.set_header('user-agent','Mozilla / 5.0(Windows NT 6.4; WOW64)AppleWebKit / 537.36(KHTML,像Gecko)Chrome / 41.0.2225.0 Safari / 537.36')
sess.set_attribute('auto_load_images',False)
sess.set_timeout(30)
sess.visit(url)
response = sess.body()
sess.reset()

更新



我仍然遇到内存泄漏的问题,更好的答案是 @nico



我已经放弃了所有的锻炼并且现在一直使用Selenium和PhantomJS。仍然有内存泄漏,但它们可以管理。


I'm using dryscrape/webkit_server for scraping javascript enabled websites.

The memory usage of the process webkit_server seems to increase with each call to session.visit(). It happens to me using the following script:

import dryscrape

for url in urls: 
    session = dryscrape.Session()
    session.set_timeout(10)
    session.set_attribute('auto_load_images', False)
    session.visit(url)
    response = session.body()

I'm iterating over approx. 300 urls and after 70-80 urls webkit_server takes up about 3GB of memory. However it is not really the memory that is the problem for me, but it seems that dryscrape/webkit_server is getting slower with each iteration. After the said 70-80 iterations dryscrape is so slow that it raises a timeout error (set timeout = 10 sec) and I need to abort the python script. Restarting the webkit_server (e.g. after every 30 iterations) might help and would empty the memory, however I'm unsure if the 'memory leaks' are really responsible for dry scrape getting slower and slower.

Does anyone know how to restart the webkit_server so I could test that?

I have not found an acceptable workaround for this issue, however I also don't want to switch to another solution (selenium/phantomjs, ghost.py) as I simply love dryscrape for its simplicity. Dryscrape is working great btw. if one is not iterating over too many urls in one session.

This issue is also discussed here

https://github.com/niklasb/dryscrape/issues/41

and here

Webkit_server (called from python's dryscrape) uses more and more memory with each page visited. How do I reduce the memory used?

解决方案

I have had the same problem with memory leaking. Solved it by resetting session after every page view!

Simplified workflow would look like this.

Setting up server:

dryscrape.start_xvfb()
sess = dryscrape.Session()

Then iterate through Url's and reset session after every url

for url in urls:
    sess.set_header('user-agent', 'Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36')
    sess.set_attribute('auto_load_images', False)
    sess.set_timeout(30)
    sess.visit(url)
    response = sess.body()
    sess.reset()

Update

I still have encountered the problem with memory leak and better answer is the one provided by @nico.

I have ended up abandoning dryscrape all together and now been using Selenium and PhantomJS. There are still memory leaks but they are manageable.

这篇关于Dryscrape / webkit_server内存泄漏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆