“无法解码来自木偶的响应". Python/Firefox无头抓取脚本中的错误消息 [英] "Failed to decode response from marionette" message in Python/Firefox headless scraping script

查看:254
本文介绍了“无法解码来自木偶的响应". Python/Firefox无头抓取脚本中的错误消息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

早上好,我已经在这里和Google上进行了许多搜索,但仍未找到解决此问题的解决方案.

Good Day, I've done a number of searches on here and google and yet to find a solution that address this problem.

场景是:

我有一个Python脚本(2.7),该脚本可循环访问多个URL(例如,认为Amazon页面,抓取评论).每个页面具有相同的HTML布局,只是抓取不同的信息.我将Selenium与无头浏览器一起使用,因为这些页面具有需要执行以获取信息的javascript.

I have a Python script (2.7) that loops through an number of URLs (e.g. think Amazon pages, scraping reviews). Each page has the same HTML layout, just scraping different information. I use Selenium with a headless browser as these pages have javascript that needs to execute to grab the information.

我在本地计算机(OSX 10.10)上运行此脚本. Firefox是最新的v59. Selenium是3.11.0的版本,使用的是geckodriver v0.20.

I run this script on my local machine (OSX 10.10). Firefox is the latest v59. Selenium is version at 3.11.0 and using geckodriver v0.20.

此脚本在本地没有问题,它可以在所有URL中运行,并且可以毫无问题地刮取页面.

This script locally has no issues, it can run through all the URLs and scrape the pages with no issue.

现在,当我将脚本放在服务器上时,唯一的区别是Ubuntu 16.04(32位).我使用了适当的geckodriver(仍为v0.20),但其他所有功能都相同(Python 2.7,Selenium 3.11).它似乎随机崩溃了无头浏览器,然后所有browserObjt.get('url...')不再起作用.

Now when I put the script on my server, the only difference is it is Ubuntu 16.04 (32 bit). I use the appropriate geckodriver (still v0.20) but everything else is the same (Python 2.7, Selenium 3.11). It appears to randomly crash the headless browser and then all of the browserObjt.get('url...') no longer work.

错误消息说:

消息:无法解码木偶的响应

Message: failed to decode response from marionette

对页面的任何其他硒请求均返回错误:

Any further selenium requests for pages return the error:

消息:尝试在未建立连接的情况下运行命令

Message: tried to run command without establishing a connection


显示一些代码:


To show some code:

当我创建驱动程序时:

    options = Options()
    options.set_headless(headless=True)

    driver = webdriver.Firefox(
        firefox_options=options,
        executable_path=config.GECKODRIVER
    )

driver作为参数browserObj传递到脚本的函数,然后用于调用特定页面,然后在加载后将其传递给BeautifulSoup进行解析:

driver is passed to the script's function as a parameter browserObj which is then used to call specific pages and then once that loads it is passed to BeautifulSoup for parsing:

browserObj.get(url)

soup = BeautifulSoup(browserObj.page_source, 'lxml')


错误可能指向导致浏览器崩溃的BeautifulSoup行.


The error might be pointing to the BeautifulSoup line which is crashing the browser.

这可能是什么原因导致的,我该怎么做才能解决该问题?

What is likely causing this, and what can I do to resolve the issue?

添加指向同一件事的堆栈跟踪:

Adding stack trace which points to the same thing:

Traceback (most recent call last):
  File "main.py", line 164, in <module>
    getLeague
  File "/home/ps/dataparsing/XXX/yyy.py", line 48, in BBB
    soup = BeautifulSoup(browserObj.page_source, 'lxml')
  File "/home/ps/AAA/projenv/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 670, in page_source
    return self.execute(Command.GET_PAGE_SOURCE)['value']
  File "/home/ps/AAA/projenv/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 312, in execute
    self.error_handler.check_response(response)
  File "/home/ps/AAA/projenv/local/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
WebDriverException: Message: Failed to decode response from marionette


注意:此脚本曾经用于Chrome.由于服务器是32位服务器,因此我只能使用仅支持Chrome v60-62的chromedriver v0.33.目前Chrome是v65,在DigitalOcean上,我似乎没有一种简单的方法可以还原到旧版本-这就是为什么我坚持使用Firefox.


Note: This script used to work with Chrome. Because the server is a 32bit server, I can only use chromedriver v0.33, which only supports Chrome v60-62. Currently Chrome is v65 and on DigitalOcean I don't seem to have an easy way to revert back to an old version - which is why I am stuck with Firefox.

推荐答案

我仍然不知道为什么会这样,但是我可能已经找到解决方法.我在一些文档中读到了可能存在竞争状况(对此我不确定,因为不应该有两个项目竞争相同的资源).

I still don't know why this is happening but I may have found a work around. I read in some documentation there may be a race condition (on what, I am not sure since there shouldn't be two items competing for the same resources).

我更改了抓取代码以执行此操作:

I changed the scraping code to do this:

import time

browserObj.get(url)

time.sleep(3)

soup = BeautifulSoup(browserObj.page_source, 'lxml')

我没有选择为何选择3秒的具体原因,但是由于添加了此延迟,因此我没有要刮擦的所有URL列表中的Message: failed to decode response from marionette错误.

No specific reason why I chose 3 seconds but since adding this delay I have not had the Message: failed to decode response from marionette error from any of my list of URLs to scrape.

更新:2018年10月

六个月后,这仍然是一个问题. Firefox,Geckodriver,Selenium和PyVirtualDisplay已全部更新为最新版本.此错误使自发重复发生而没有模式:有时起作用,有时不起作用.

This continues to be an issue over six months later. Firefox, Geckodriver, Selenium and PyVirtualDisplay have all been updated to their latest versions. This error kept reoccurring spontaneously without pattern: sometimes working and sometimes not.

解决此问题的方法是将服务器上的RAM从1 GB增加到2 GB.自从增加以来,没有发生过这样的失败.

What fixed this issue is increasing RAM on my server from 1 GB to 2 GB. Since the increase there have been no failures of this sort.

这篇关于“无法解码来自木偶的响应". Python/Firefox无头抓取脚本中的错误消息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆