带有代理的 Selenium 返回空网站 [英] Selenium with proxy returns empty website

查看:77
本文介绍了带有代理的 Selenium 返回空网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法通过代理从带有 selenium 的站点中获取页面源 HTML.这是我的代码

I am having trouble getting a page source HTML out of a site with selenium through a proxy. Here is my code

from selenium.webdriver.chrome.options import Options
from selenium import webdriver
import codecs
import time

import shutil

proxy_username = 'myProxyUser'
proxy_password = 'myProxyPW'
port = '1080'
hostname = 'myProxyIP'

PROXY = proxy_username+":"+proxy_password+"@"+hostname+":"+port

options = Options()
options.add_argument("--headless")
options.add_argument("--kiosk")
options.add_argument('--proxy-server=%s' %PROXY)

driver = webdriver.Chrome(r'C:\Users\kingOtto\Downloads\chromedriver\chromedriver.exe', options=options)

driver.get("https://www.whatismyip.com")
time.sleep(10)
html = driver.page_source
f = codecs.open('dummy.html', "w", "utf-8")
f.write(html)

driver.close()

这会导致 HTML 非常不完整,仅显示头部和正文的外括号:

This results in a very incomplete HTML, showing only outer brackets of head and body:

html
Out[3]: '<html><head></head><body></body></html>'

此外,写入磁盘的 dummy.html 文件没有显示上一行中显示的任何其他内容.

Also the dummy.html file written to disk does not show any other content that what is displayed in the line above.

我迷路了,这是我尝试过的

I am lost, here is what I tried

  1. 当我在没有 options.add_argument('--proxy-server=%s' %PROXY) 行的情况下运行它时,它确实有效.所以我确定它是代理.但是代理连接本身似乎没问题(我没有收到任何代理连接错误 - 而且我确实从网站上获得了外框,对吗?所以驱动程序请求通过并返回给我)
  2. 不同的 URL:不仅 whatismyip.com 失败,任何其他页面也失败 - 尝试了不同的新闻媒体,如 CNN 甚至谷歌 - 除了头部和身体括号外,几乎没有任何网站返回任何内容.不可能是任何 javascript/iframe 问题,对吧?
  3. 不同的等待时间(这篇文章没有帮助:Make Selenium等待 10 秒),最多 60 秒——加上我的连接速度超快,<1 秒应该足够了(在浏览器中)
  1. It does work when I run it without options.add_argument('--proxy-server=%s' %PROXY) line. So I am sure it is the proxy. But the proxy connection itself seems to be ok (I do not get any proxy connection erros - plus I do get the outer frame from the website, right? So the driver request gets through & back to me)
  2. Different URLs: Not only whatismyip.com fails, also any other pages - tried different news outlets such as CNN or even google - virtually nothing comes back from any website, except for head and body brackets. It cannot be any javascript/iframe issue, right?
  3. Different wait times (this article does not help: Make Selenium wait 10 seconds), up to 60 seconds -- plus my connection is super fast, <1 second should be enough (in browser)

我对连接有什么误解?

推荐答案

driver.page_source 并不总是通过 selenium 返回您期望的内容.它可能不是完整的 dom.这在 selenium doc 和各种 SO 答案中都有记录,例如:https://stackoverflow.com/a/45247539/1387701

driver.page_source does not always return what you expect via selenium. It's likely NOT the full dom. This is documented in the selenium doc and in various SO answers, e.g.: https://stackoverflow.com/a/45247539/1387701

Selenium 尽最大努力在获取页面源时提供它.只有高度动态的页面,这通常会限制它的返回.

Selenium gives a best effort to provide the page source as it is fetched. Only highly dynamic pages this can often be limited in it's return.

这篇关于带有代理的 Selenium 返回空网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆