Python请求没有给我与浏览器相同的HTML [英] Python requests isn't giving me the same HTML as my browser is

查看:89
本文介绍了Python请求没有给我与浏览器相同的HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python请求获取Wikia页面。但是,这里存在一个问题:请求请求给我的HTML与浏览器具有相同页面的HTML 不同。

I am grabbing a Wikia page using Python requests. There's a problem, though: the requests request isn't giving me the same HTML as my browser is with the very same page.

比较,这是Firefox引导我的页面,并且< a href = https://www.dropbox.com/s/gwnqtmrkr5zxgmn/yokai-pythonrequests.html?dl=0 rel = noreferrer>这是请求获取页面的页面(下载它们以查看-抱歉,没有一种简单的方法可以直观地从另一个站点托管一些HTML)。

For comparison, here's the page Firefox gets me, and here's the page requests fetches (download them to view - sorry, no easy way to just visually host a bit of HTML from another site).

您会注意到一些区别(超级不友好的差异)。有一些小东西,例如属性beinig具有不同的顺序等,但是也有一些非常非常大的东西。最重要的是最后六个< img> 缺失,以及导航和页脚部分的全部。即使在原始HTML中,页面也似乎突然中断。

You'll note a few differences (super unfriendly diff). There are some small things, like attributes beinig ordered differently and such, but there are also a few very, very large things. Most important is the lack of the last six <img>s, and the entirety of the navigation and footer sections. Even in the raw HTML it looks like the page cut off abruptly.

为什么会发生这种情况,有没有办法解决?我已经想到了很多事情,没有一件事情能取得成果:

Why is this happening, and is there a way to fix it? I've thought of a bunch of things already, none of which have been fruitful:


  • 请求标头会干扰吗?是的,我尝试将浏览器发送的标头 User-Agent 以及所有1:1复制到请求请求中,但没有任何变化。

  • JavaScript是否在HTML加载后加载内容?没事即使禁用了JS,Firefox也给了我一个好页面。

  • 嗯...嗯...还有什么呢?

  • Request headers interfering? Nope, I tried copying the headers my browser sends, User-Agent and all, 1:1 into the requests request, but nothing changed.
  • JavaScript loading content after the HTML is loaded? Nah. Even with JS disabled, Firefox gives me the "good" page.
  • Uh... well... what else could there be?

如果您知道这种情况的发生方式和修复方式,那就太了不起了。谢谢!

It'd be amazing if you know a way this could happen and a way to fix it. Thank you!

推荐答案

我遇到了类似的问题:


  • 使用Python并通过浏览器使用相同的标头

  • JavaScript绝对被排除为原因

要解决此问题,我最终将 requests 库换成了 urllib.request

To resolve the issue, I ended up swapping out the requests library for urllib.request.

基本上,我替换为:

import requests

session = requests.Session()
r = session.get(URL)

with:

import urllib.request

r = urllib.request.urlopen(URL)

然后起作用。

也许其中一个库在幕后做了一些奇怪的事情?不知道这是否适合您。

Maybe one of those libraries is doing something strange behind the scenes? Not sure if that's an option for you or not.

这篇关于Python请求没有给我与浏览器相同的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆