为什么 python 和我的网络浏览器为同一个链接显示不同的代码? [英] Why does python and my web browser show different codes for the same link?

查看:26
本文介绍了为什么 python 和我的网络浏览器为同一个链接显示不同的代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们使用 url

现在,如果我运行此代码:

from urllib.request import urlopen从 bs4 导入 BeautifulSoupurl = urlopen("https://www.google.cl/#q=stackoverflow")汤 = BeautifulSoup(url)打印(汤.美化())

我找不到相同的元素.事实上,我不会从谷歌搜索给出的结果中找到任何链接.如果我使用 requests 模块也是如此.为什么会发生这种情况?我可以做些什么来获得与从网络浏览器请求相同的结果吗?

解决方案

由于 html 是动态生成的,很可能来自现代单页 javascript 框架,如 Angular 或 React(甚至只是普通的 JavaScript),因此您需要实际驱动在解析 dom 之前,使用 selenium 或 phantomjs 访问站点的浏览器.

这是一些框架代码.

from selenium import webdriver从 bs4 导入 BeautifulSoup驱动程序 = webdriver.Chrome()driver.get("http://google.com")html = driver.execute_script("返回 document.documentElement.innerHTML")汤 = BeautifulSoup(html)

这里是有关运行 selenium、配置等的更多信息的 selenium 文档:

http://selenium-python.readthedocs.io/

您可能需要在抓取 html 之前添加 wait,因为加载页面的某些元素可能需要一秒钟左右.请参阅下文以参考 python selenium 的显式等待文档:

http://selenium-python.readthedocs.io/waits.html

另一个复杂的原因是页面的某些部分可能会隐藏,直到用户交互之后.在这种情况下,您需要编写 selenium 脚本,以便在抓取 html 之前以特定方式与页面交互.

Let's use the url https://www.google.cl/#q=stackoverflow as an example. Using Chrome Developer Tools on the first link given by the search we see this html code:

Now, if I run this code:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = urlopen("https://www.google.cl/#q=stackoverflow")
soup = BeautifulSoup(url)
print(soup.prettify())

I wont find the same elements. In fact, I wont find any link from the results given by the google search. Same goes if I use the requests module. Why does this happen? Can I do something to get the same results as if I was requesting from a web browser?

解决方案

Since the html is generated dynamically, likely from a modern single page javascript framework like Angular or React (or even just plain JavaScript), you will need to actually drive a browser to the site using selenium or phantomjs before parsing the dom.

Here is some skeleton code.

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get("http://google.com")

html = driver.execute_script("return document.documentElement.innerHTML")
soup = BeautifulSoup(html)

Here is the selenium documentation for more info on running selenium, configurations, etc.:

http://selenium-python.readthedocs.io/

edit: you will likely need to add a wait before grabbing the html, since it may take a second or so to load certain elements of the page. See below for reference to the explicity wait documentation of python selenium:

http://selenium-python.readthedocs.io/waits.html

Another source of complication is that certain parts of the page might be hidden until AFTER user interaction. In this case you will need to code your selenium script to interact with the page in certain ways before grabbing the html.

这篇关于为什么 python 和我的网络浏览器为同一个链接显示不同的代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆