如果页面之前已重定向到另一个页面,则如何抓取该页面 [英] How to scrape a page if it is redirected to another before
问题描述
我正在尝试从 https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms
,但是如您所见,当它通过Web驱动程序加载链接时,它会自动将其重定向到登录页面.登录后,它会直接转到我要抓取的页面,但是Beautiful Soup只会不断抓取登录页面.
I am trying to scrape some text off of https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms
, but as you can see when it loads up the link through web-driver it automatically redirects it to a log in page. After I log in, it then goes straight to the page I want to scrape, but Beautiful Soup just keeps scraping the log in page.
如何使它如此美丽的汤刮我想要的页面而不是登录页面?
How do I make it so Beautiful Soup scrapes the page I want it to and not the login page?
我已经尝试过在 time.sleep()
之前抓取时间,以便给我时间登录,但这还是行不通的.
I have already tried putting a time.sleep()
before it scrapes to give me time to log in but that didn't work either.
soup = BeautifulSoup(requests.get("https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms").text, 'html.parser')
while True:
front_half = soup.find_all(class_='qquestion qtext')
print(front_half)
time.sleep(1)
推荐答案
您可能需要的是带有 requests
的持久会话.此答案可能完全满足您的需求.总体思路很简单:
What you probably need is a persistent session with requests
. This answer probably covers exactly what you need. The general idea is simple:
- 您打开一个会话并将请求发送到网站
- 发送登录请求,以便您登录
- 使用相同的会话查询URL.
您将需要了解登录帖子请求的结构以及传递的数据(用户名,电子邮件等),并使用该数据创建 json
.
You will need to understand how the login post request is structured and what data is passed (username, email, etc) and create a json
with that data.
import requests
url = 'https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms'
session = requests.session()
login_data = {
'username': ,
'csrfmiddlewaretoken': ,
'password': ,
'next': '/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms'
}
session.get(url) #this will redirect you and it might load some initial cookies info
r = session.post('https://<theurl>/login.py', login_data)
if r.status_code == 200: #if accepted the request
res = session.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
## (...) your scraping code
这篇关于如果页面之前已重定向到另一个页面,则如何抓取该页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!