如果页面之前已重定向到另一个页面，则如何抓取该页面 [英] How to scrape a page if it is redirected to another before

查看：28 发布时间：2021/4/15 19:09:46 python html web-scraping beautifulsoup

本文介绍了如果页面之前已重定向到另一个页面，则如何抓取该页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从 https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms ，但是如您所见，当它通过Web驱动程序加载链接时，它会自动将其重定向到登录页面.登录后，它会直接转到我要抓取的页面，但是Beautiful Soup只会不断抓取登录页面.

I am trying to scrape some text off of https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms, but as you can see when it loads up the link through web-driver it automatically redirects it to a log in page. After I log in, it then goes straight to the page I want to scrape, but Beautiful Soup just keeps scraping the log in page.

如何使它如此美丽的汤刮我想要的页面而不是登录页面?

How do I make it so Beautiful Soup scrapes the page I want it to and not the login page?

我已经尝试过在 time.sleep()之前抓取时间，以便给我时间登录，但这还是行不通的.

I have already tried putting a time.sleep() before it scrapes to give me time to log in but that didn't work either.

soup = BeautifulSoup(requests.get("https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms").text, 'html.parser')
while True:
    front_half = soup.find_all(class_='qquestion qtext')
    print(front_half)
    time.sleep(1)

推荐答案

您可能需要的是带有 requests 的持久会话.此答案可能完全满足您的需求.总体思路很简单:

What you probably need is a persistent session with requests. This answer probably covers exactly what you need. The general idea is simple:

您打开一个会话并将请求发送到网站
发送登录请求，以便您登录
使用相同的会话查询URL.

您将需要了解登录帖子请求的结构以及传递的数据(用户名，电子邮件等)，并使用该数据创建 json .

You will need to understand how the login post request is structured and what data is passed (username, email, etc) and create a json with that data.

import requests

url = 'https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms'

session = requests.session()

login_data = {
    'username': ,
    'csrfmiddlewaretoken': ,
    'password': ,
    'next': '/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms'
}

session.get(url) #this will redirect you and it might load some initial cookies info

r = session.post('https://<theurl>/login.py', login_data)

if r.status_code == 200: #if accepted the request
    res = session.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    ## (...) your scraping code

这篇关于如果页面之前已重定向到另一个页面，则如何抓取该页面的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如果页面之前已重定向到另一个页面，则如何抓取该页面 [英] How to scrape a page if it is redirected to another before

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

如果页面之前已重定向到另一个页面，则如何抓取该页面 [英] How to scrape a page if it is redirected to another before

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭