PhantomJS返回空网页(Python,Selenium) [英] PhantomJS returning empty web page (python, Selenium)

查看:178
本文介绍了PhantomJS返回空网页(Python,Selenium)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试通过屏幕抓取网站,而不必在python脚本中启动实际的浏览器实例(使用Selenium).我可以使用Chrome或Firefox来做到这一点-我已经尝试过并且可以使用-但是我想使用PhantomJS,所以它没有头.

Trying to screen scrape a web site without having to launch an actual browser instance in a python script (using Selenium). I can do this with Chrome or Firefox - I've tried it and it works - but I want to use PhantomJS so it's headless.

代码如下:

import sys
import traceback
import time

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
    "(KHTML, like Gecko) Chrome/15.0.87"
)

try:
    # Choose our browser
    browser = webdriver.PhantomJS(desired_capabilities=dcap)
    #browser = webdriver.PhantomJS()
    #browser = webdriver.Firefox()
    #browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")

    # Go to the login page
    browser.get("https://www.whatever.com")

    # For debug, see what we got back
    html_source = browser.page_source
    with open('out.html', 'w') as f:
        f.write(html_source)

    # PROCESS THE PAGE (code removed)

except Exception, e:
    browser.save_screenshot('screenshot.png')
    traceback.print_exc(file=sys.stdout)

finally:
    browser.close()

输出仅为:

<html><head></head><body></body></html>

但是当我使用Chrome或Firefox选项时,它可以正常工作.我以为该网站可能会根据用户代理返回垃圾邮件,因此我尝试伪装成该垃圾邮件.没什么.

But when I use the Chrome or Firefox options, it works fine. I thought maybe the web site was returning junk based on the user agent, so I tried faking that out. No difference.

我想念什么?

已更新:我将尝试将以下代码段更新为最新版本,直到它起作用为止.下面是我目前正在尝试的方法.

UPDATED: I will try to keep the below snippet updated with until it works. What's below is what I'm currently trying.

import sys
import traceback
import time
import re

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support import expected_conditions as EC

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87")

try:
    # Set up our browser
    browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])
    #browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")

    # Go to the login page
    print "getting web page..."
    browser.get("https://www.website.com")

    # Need to wait for the page to load
    timeout = 10
    print "waiting %s seconds..." % timeout
    wait = WebDriverWait(browser, timeout)
    element = wait.until(EC.element_to_be_clickable((By.ID,'the_id')))
    print "done waiting. Response:"

    # Rest of code snipped. Fails as "wait" above.

推荐答案

我遇到了同样的问题,没有足够的代码让驱动程序等待帮助.
问题在于https网站上的SSL加密,忽略它们将解决问题.

I was facing the same problem and no amount of code to make the driver wait was helping.
The problem is the SSL encryption on the https websites, ignoring them will do the trick.

以以下方式调用PhantomJS驱动程序:

Call the PhantomJS driver as:

driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=TLSv1'])

这为我解决了问题.

这篇关于PhantomJS返回空网页(Python,Selenium)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆