如果页面之前已重定向到另一个页面,则如何抓取该页面 [英] How to scrape a page if it is redirected to another before

查看:28
本文介绍了如果页面之前已重定向到另一个页面,则如何抓取该页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms ,但是如您所见,当它通过Web驱动程序加载链接时,它会自动将其重定向到登录页面.登录后,它会直接转到我要抓取的页面,但是Beautiful Soup只会不断抓取登录页面.

I am trying to scrape some text off of https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms, but as you can see when it loads up the link through web-driver it automatically redirects it to a log in page. After I log in, it then goes straight to the page I want to scrape, but Beautiful Soup just keeps scraping the log in page.

如何使它如此美丽的汤刮我想要的页面而不是登录页面?

How do I make it so Beautiful Soup scrapes the page I want it to and not the login page?

我已经尝试过在 time.sleep()之前抓取时间,以便给我时间登录,但这还是行不通的.

I have already tried putting a time.sleep() before it scrapes to give me time to log in but that didn't work either.

soup = BeautifulSoup(requests.get("https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms").text, 'html.parser')
while True:
    front_half = soup.find_all(class_='qquestion qtext')
    print(front_half)
    time.sleep(1)

推荐答案

您可能需要的是带有 requests 的持久会话.此答案可能完全满足您的需求.总体思路很简单:

What you probably need is a persistent session with requests. This answer probably covers exactly what you need. The general idea is simple:

  1. 您打开一个会话并将请求发送到网站
  2. 发送登录请求,以便您登录
  3. 使用相同的会话查询URL.

您将需要了解登录帖子请求的结构以及传递的数据(用户名,电子邮件等),并使用该数据创建 json .

You will need to understand how the login post request is structured and what data is passed (username, email, etc) and create a json with that data.

import requests

url = 'https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms'

session = requests.session()

login_data = {
    'username': ,
    'csrfmiddlewaretoken': ,
    'password': ,
    'next': '/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms'
}

session.get(url) #this will redirect you and it might load some initial cookies info

r = session.post('https://<theurl>/login.py', login_data)

if r.status_code == 200: #if accepted the request
    res = session.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    ## (...) your scraping code

这篇关于如果页面之前已重定向到另一个页面,则如何抓取该页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆