“无法解码来自牵线木偶的响应"Python/Firefox 无头抓取脚本中的消息 [英] "Failed to decode response from marionette" message in Python/Firefox headless scraping script

查看:22
本文介绍了“无法解码来自牵线木偶的响应"Python/Firefox 无头抓取脚本中的消息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

美好的一天,我已经在此处和谷歌上进行了大量搜索,但仍未找到解决此问题的解决方案.

Good Day, I've done a number of searches on here and google and yet to find a solution that address this problem.

场景是:

我有一个 Python 脚本 (2.7),它循环访问多个 URL(例如,想想亚马逊页面、抓取评论).每个页面都有相同的 HTML 布局,只是抓取不同的信息.我将 Selenium 与无头浏览器结合使用,因为这些页面包含需要执行以获取信息的 javascript.

I have a Python script (2.7) that loops through an number of URLs (e.g. think Amazon pages, scraping reviews). Each page has the same HTML layout, just scraping different information. I use Selenium with a headless browser as these pages have javascript that needs to execute to grab the information.

我在本地机器 (OSX 10.10) 上运行此脚本.Firefox 是最新的 v59.Selenium 版本为 3.11.0,使用 geckodriver v0.20.

I run this script on my local machine (OSX 10.10). Firefox is the latest v59. Selenium is version at 3.11.0 and using geckodriver v0.20.

这个脚本在本地没有问题,它可以运行所有的 URL 并没有问题地抓取页面.

This script locally has no issues, it can run through all the URLs and scrape the pages with no issue.

现在当我将脚本放在我的服务器上时,唯一的区别是它是 Ubuntu 16.04(32 位).我使用适当的 geckodriver(仍然是 v0.20),但其他一切都一样(Python 2.7、Selenium 3.11).它似乎使无头浏览器随机崩溃,然后所有 browserObjt.get('url...') 不再工作.

Now when I put the script on my server, the only difference is it is Ubuntu 16.04 (32 bit). I use the appropriate geckodriver (still v0.20) but everything else is the same (Python 2.7, Selenium 3.11). It appears to randomly crash the headless browser and then all of the browserObjt.get('url...') no longer work.

错误消息说:

消息:无法解码来自牵线木偶的响应

Message: failed to decode response from marionette

对页面的任何进一步 selenium 请求都会返回错误:

Any further selenium requests for pages return the error:

消息:尝试在没有建立连接的情况下运行命令

Message: tried to run command without establishing a connection

<小时>

显示一些代码:


To show some code:

当我创建驱动程序时:

    options = Options()
    options.set_headless(headless=True)

    driver = webdriver.Firefox(
        firefox_options=options,
        executable_path=config.GECKODRIVER
    )

driver 作为参数传递给脚本的函数 browserObj 然后用于调用特定页面,然后一旦加载它就会传递给 BeautifulSoup 进行解析:

driver is passed to the script's function as a parameter browserObj which is then used to call specific pages and then once that loads it is passed to BeautifulSoup for parsing:

browserObj.get(url)

soup = BeautifulSoup(browserObj.page_source, 'lxml')

<小时>

错误可能指向使浏览器崩溃的 BeautifulSoup 行.


The error might be pointing to the BeautifulSoup line which is crashing the browser.

可能导致这种情况的原因是什么,我该怎么做才能解决这个问题?

What is likely causing this, and what can I do to resolve the issue?

添加指向同一事物的堆栈跟踪:

Adding stack trace which points to the same thing:

Traceback (most recent call last):
  File "main.py", line 164, in <module>
    getLeague
  File "/home/ps/dataparsing/XXX/yyy.py", line 48, in BBB
    soup = BeautifulSoup(browserObj.page_source, 'lxml')
  File "/home/ps/AAA/projenv/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 670, in page_source
    return self.execute(Command.GET_PAGE_SOURCE)['value']
  File "/home/ps/AAA/projenv/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 312, in execute
    self.error_handler.check_response(response)
  File "/home/ps/AAA/projenv/local/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
WebDriverException: Message: Failed to decode response from marionette

<小时>

注意:此脚本曾用于 Chrome.因为服务器是32位服务器,所以我只能使用chromedriver v0.33,它只支持Chrome v60-62.目前 Chrome 是 v65,在 DigitalOcean 上,我似乎没有一种简单的方法可以恢复到旧版本 - 这就是我坚持使用 Firefox 的原因.


Note: This script used to work with Chrome. Because the server is a 32bit server, I can only use chromedriver v0.33, which only supports Chrome v60-62. Currently Chrome is v65 and on DigitalOcean I don't seem to have an easy way to revert back to an old version - which is why I am stuck with Firefox.

推荐答案

我仍然不知道为什么会发生这种情况,但我可能已经找到了解决方法.我在一些文档中读到可能存在竞争条件(关于什么,我不确定,因为不应该有两个项目竞争相同的资源).

I still don't know why this is happening but I may have found a work around. I read in some documentation there may be a race condition (on what, I am not sure since there shouldn't be two items competing for the same resources).

我更改了抓取代码来执行此操作:

I changed the scraping code to do this:

import time

browserObj.get(url)

time.sleep(3)

soup = BeautifulSoup(browserObj.page_source, 'lxml')

没有具体原因为什么我选择了 3 秒,但自从添加了这个延迟后,我没有收到 Message: failed to decode response from marionette 从我的任何要抓取的 URL 列表中得到的错误.

No specific reason why I chose 3 seconds but since adding this delay I have not had the Message: failed to decode response from marionette error from any of my list of URLs to scrape.

更新:2018 年 10 月

六个月后,这仍然是一个问题.Firefox、Geckodriver、Selenium 和 PyVirtualDisplay 都已更新到最新版本.此错误无规律地自发地重复发生:有时有效,有时无效.

This continues to be an issue over six months later. Firefox, Geckodriver, Selenium and PyVirtualDisplay have all been updated to their latest versions. This error kept reoccurring spontaneously without pattern: sometimes working and sometimes not.

解决此问题的是将服务器上的 RAM 从 1 GB 增加到 2 GB.自从增加以来,就没有出现过此类故障.

What fixed this issue is increasing RAM on my server from 1 GB to 2 GB. Since the increase there have been no failures of this sort.

这篇关于“无法解码来自牵线木偶的响应"Python/Firefox 无头抓取脚本中的消息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆