python requests.get 总是得到 404 [英] python requests.get always get 404

查看:299
本文介绍了python requests.get 总是得到 404的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想尝试将 requests.get 发送到此网站:

requests.get('https://rent.591.com.tw')

我总是得到

<响应 [404]>

我知道这是一个常见问题并尝试了不同的方法但仍然失败.但所有其他网站都可以.

有什么建议吗?

解决方案

Web 服务器是黑匣子.他们可以根据您的请求、一天中的时间、月相或他们选择的任何其他标准返回任何有效的 HTTP 响应.如果另一个 HTTP 客户端收到不同的响应,请始终尝试找出 Python 发送的请求与其他客户端发送的请求有何不同.

这意味着您需要:

  • 记录工作请求的所有方面
  • 记录失败请求的所有方面
  • 尝试进行哪些更改,使失败的请求更像工作请求,并尽量减少这些更改.

我通常将我的请求指向 http://httpbin.org 端点,让它记录请求,然后实验.

对于 requests,有几个标头是自动设置的,其中许多标头通常不需要更改:

  • 主机;这必须设置为您正在联系的主机名,以便它可以正确地多托管不同的站点.requests 设置了这个.
  • Content-LengthContent-Type,对于 POST 请求,通常根据您传递给 requests 的参数设置.如果这些不匹配,请更改您传递给 requests 的参数(但要注意 multipart/* 请求,它们使用 中记录的生成边界Content-Type 标头;将其生成留给 requests).
  • Connection:交给客户端管理
  • Cookies:这些通常在初始 GET 请求或首次登录站点后设置.确保您使用 requests.Session()<捕获 cookie/code> object 并且您已登录(以与浏览器相同的方式提供凭据).

其他一切都是公平的游戏,但如果 requests 设置了默认值,那么这些默认值通常不是问题.也就是说,我通常从 User-Agent 标头开始,然后从那里开始.

在这种情况下,该站点正在对用户代理进行过滤,看起来他们正在将 Python 列入黑名单,将其设置为几乎任何其他值 已经有效:><预><代码>>>>requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})<响应[200]>

接下来,您需要考虑到 requests 不是浏览器.requests 只是一个 HTTP 客户端,浏览器可以做很多很多事情.浏览器解析 HTML 以获取附加资源,例如图像、字体、样式和脚本,也加载这些附加资源并执行脚本.然后脚本可以更改浏览器显示的内容并加载其他资源.如果您的 requests 结果与您在浏览器中看到的不匹配,但浏览器发出的初始请求匹配,那么您需要找出哪些其他资源浏览器已加载并根据需要使用 requests 发出其他请求.如果一切都失败了,请使用像 requests-html,它允许您通过实际的无头 Chromium 浏览器运行 URL.

您尝试联系的网站向 https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype 发出了额外的 AJAX 请求=1&region=1,如果您试图从该站点抓取数据,请考虑到这一点.

接下来,构建良好的网站将使用安全最佳实践,例如 CSRF 令牌,它要求您以正确的顺序发出请求(例如,在向处理程序发送 POST 之前检索表单的 GET 请求)并处理 cookie 或以其他方式提取服务器期望从一个请求传递到另一个请求的额外信息.

最后但并非最不重要的一点是,如果一个站点阻止脚本发出请求,他们可能要么试图强制执行禁止抓取的服务条款,要么因为他们有一个 API,他们宁愿让您使用.检查两者,并考虑到如果您继续抓取网站,您可能会更有效地被阻止.

I would like to try send requests.get to this website:

requests.get('https://rent.591.com.tw')

and I always get

<Response [404]>

I knew this is a common problem and tried different way but still failed. but all of other website is ok.

any suggestion?

解决方案

Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.

That means you need to:

  • Record all aspects of the working request
  • Record all aspects of the failing request
  • Try out what changes you can make to make the failing request more like the working request, and minimise those changes.

I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.

For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:

  • Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
  • Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
  • Connection: leave this to the client to manage
  • Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).

Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.

In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:

>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>

Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.

The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.

Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.

Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.

这篇关于python requests.get 总是得到 404的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆