Python请求没有给我与浏览器相同的HTML [英] Python requests isn't giving me the same HTML as my browser is
问题描述
我正在使用Python请求获取Wikia页面。但是,这里存在一个问题:请求请求给我的HTML与浏览器具有相同页面的HTML 不同。
I am grabbing a Wikia page using Python requests. There's a problem, though: the requests request isn't giving me the same HTML as my browser is with the very same page.
比较,这是Firefox引导我的页面,并且< a href = https://www.dropbox.com/s/gwnqtmrkr5zxgmn/yokai-pythonrequests.html?dl=0 rel = noreferrer>这是请求获取页面的页面(下载它们以查看-抱歉,没有一种简单的方法可以直观地从另一个站点托管一些HTML)。
For comparison, here's the page Firefox gets me, and here's the page requests fetches (download them to view - sorry, no easy way to just visually host a bit of HTML from another site).
您会注意到一些区别(超级不友好的差异)。有一些小东西,例如属性beinig具有不同的顺序等,但是也有一些非常非常大的东西。最重要的是最后六个< img>
的缺失,以及导航和页脚部分的全部。即使在原始HTML中,页面也似乎突然中断。
You'll note a few differences (super unfriendly diff). There are some small things, like attributes beinig ordered differently and such, but there are also a few very, very large things. Most important is the lack of the last six <img>
s, and the entirety of the navigation and footer sections. Even in the raw HTML it looks like the page cut off abruptly.
为什么会发生这种情况,有没有办法解决?我已经想到了很多事情,没有一件事情能取得成果:
Why is this happening, and is there a way to fix it? I've thought of a bunch of things already, none of which have been fruitful:
- 请求标头会干扰吗?是的,我尝试将浏览器发送的标头
User-Agent
以及所有1:1复制到请求请求中,但没有任何变化。 - JavaScript是否在HTML加载后加载内容?没事即使禁用了JS,Firefox也给了我一个好页面。
- 嗯...嗯...还有什么呢?
- Request headers interfering? Nope, I tried copying the headers my browser sends,
User-Agent
and all, 1:1 into the requests request, but nothing changed. - JavaScript loading content after the HTML is loaded? Nah. Even with JS disabled, Firefox gives me the "good" page.
- Uh... well... what else could there be?
如果您知道这种情况的发生方式和修复方式,那就太了不起了。谢谢!
It'd be amazing if you know a way this could happen and a way to fix it. Thank you!
推荐答案
我遇到了类似的问题:
- 使用Python并通过浏览器使用相同的标头
- JavaScript绝对被排除为原因
要解决此问题,我最终将 requests 库换成了 urllib.request 。
To resolve the issue, I ended up swapping out the requests library for urllib.request.
基本上,我替换为:
import requests
session = requests.Session()
r = session.get(URL)
with:
import urllib.request
r = urllib.request.urlopen(URL)
然后起作用。
也许其中一个库在幕后做了一些奇怪的事情?不知道这是否适合您。
Maybe one of those libraries is doing something strange behind the scenes? Not sure if that's an option for you or not.
这篇关于Python请求没有给我与浏览器相同的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!