带有代理的 Selenium 返回空网站 [英] Selenium with proxy returns empty website
问题描述
我无法通过代理从带有 selenium 的站点中获取页面源 HTML.这是我的代码
I am having trouble getting a page source HTML out of a site with selenium through a proxy. Here is my code
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
import codecs
import time
import shutil
proxy_username = 'myProxyUser'
proxy_password = 'myProxyPW'
port = '1080'
hostname = 'myProxyIP'
PROXY = proxy_username+":"+proxy_password+"@"+hostname+":"+port
options = Options()
options.add_argument("--headless")
options.add_argument("--kiosk")
options.add_argument('--proxy-server=%s' %PROXY)
driver = webdriver.Chrome(r'C:\Users\kingOtto\Downloads\chromedriver\chromedriver.exe', options=options)
driver.get("https://www.whatismyip.com")
time.sleep(10)
html = driver.page_source
f = codecs.open('dummy.html', "w", "utf-8")
f.write(html)
driver.close()
这会导致 HTML 非常不完整,仅显示头部和正文的外括号:
This results in a very incomplete HTML, showing only outer brackets of head and body:
html
Out[3]: '<html><head></head><body></body></html>'
此外,写入磁盘的 dummy.html
文件没有显示上一行中显示的任何其他内容.
Also the dummy.html
file written to disk does not show any other content that what is displayed in the line above.
我迷路了,这是我尝试过的
I am lost, here is what I tried
- 当我在没有
options.add_argument('--proxy-server=%s' %PROXY)
行的情况下运行它时,它确实有效.所以我确定它是代理.但是代理连接本身似乎没问题(我没有收到任何代理连接错误 - 而且我确实从网站上获得了外框,对吗?所以驱动程序请求通过并返回给我) - 不同的 URL:不仅 whatismyip.com 失败,任何其他页面也失败 - 尝试了不同的新闻媒体,如 CNN 甚至谷歌 - 除了头部和身体括号外,几乎没有任何网站返回任何内容.不可能是任何 javascript/iframe 问题,对吧?
- 不同的等待时间(这篇文章没有帮助:Make Selenium等待 10 秒),最多 60 秒——加上我的连接速度超快,<1 秒应该足够了(在浏览器中)
- It does work when I run it without
options.add_argument('--proxy-server=%s' %PROXY)
line. So I am sure it is the proxy. But the proxy connection itself seems to be ok (I do not get any proxy connection erros - plus I do get the outer frame from the website, right? So the driver request gets through & back to me) - Different URLs: Not only whatismyip.com fails, also any other pages - tried different news outlets such as CNN or even google - virtually nothing comes back from any website, except for head and body brackets. It cannot be any javascript/iframe issue, right?
- Different wait times (this article does not help: Make Selenium wait 10 seconds), up to 60 seconds -- plus my connection is super fast, <1 second should be enough (in browser)
我对连接有什么误解?
推荐答案
driver.page_source 并不总是通过 selenium 返回您期望的内容.它可能不是完整的 dom.这在 selenium doc 和各种 SO 答案中都有记录,例如:https://stackoverflow.com/a/45247539/1387701
driver.page_source does not always return what you expect via selenium. It's likely NOT the full dom. This is documented in the selenium doc and in various SO answers, e.g.: https://stackoverflow.com/a/45247539/1387701
Selenium 尽最大努力在获取页面源时提供它.只有高度动态的页面,这通常会限制它的返回.
Selenium gives a best effort to provide the page source as it is fetched. Only highly dynamic pages this can often be limited in it's return.
这篇关于带有代理的 Selenium 返回空网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!