定时延迟或重定向后获取最终URL [英] Get final url after timed delay or redirect

查看:73
本文介绍了定时延迟或重定向后获取最终URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取网站,但是当我打开网页时,它具有5秒钟的重定向延迟,即您必须等待5秒钟,然后才能加载真实的页面.我尝试了以下代码.

I am trying to scrape a website, but when I open the webpage it has 5 seconds redirect delay, i.e. you have to wait for 5 sec and then the real page loads. I have tried the following code .

from bs4 import BeautifulSoup
import time
import requests

r=requests.get("https://etherscan.io/address/0xc257274276a4e539741ca11b590b9447b26a8051",timeout=6)
time.sleep(5) 
print(r.history)

data=r.text

soup=BeautifulSoup(data)

print(soup.prettify())

但是当我运行代码时,我得到的是重定向页面,而不是最终页面.感谢您的帮助

But when I run the code I get the redirect page, not the final page. Thanks for help

推荐答案

似乎etherscan.io受 Cloudflare ,然后Cloudflare导致您看到的延迟重定向.Cloudflare的目的之一是防止机器人向站点发出自动请求(这看起来很像您在做什么).

It looks like etherscan.io is protected by Cloudflare, and Cloudflare is causing the delayed redirect that you are seeing. One of the purposes of Cloudflare is to prevent bots from making automated requests to the site (which seems a lot like what you are doing).

绕过Cloudflare并非易事.首先,您需要使您的请求看起来像"来自真实的浏览器-这意味着您用来发出这些请求的工具需要提供与真实浏览器相同的请求标头,以处理类似Cookie的请求浏览器可以像浏览器一样运行javascript等.

Getting around Cloudflare will not be easy. First, you'll need to make your requests 'look like' they are coming from a real browser - meaning that the tool that you are using to make these requests needs to present the same request headers that a real browser would, handle cookies like a browser would, run javascript like a browser would, etc.

即使您成功完成上述所有操作,在一段时间内发出一定数量的请求后,Cloudflare仍可能会阻止您的请求(或挑战您的请求).

Even if you succeed in doing all of the above, Cloudflare is likely to block your requests (or challenge them) after certain number of requests have been made over some period of time.

这篇关于定时延迟或重定向后获取最终URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆