刮擦网站,要求使用BeautifulSoup登录 [英] Scrape website that require login with BeautifulSoup

查看:51
本文介绍了刮擦网站,要求使用BeautifulSoup登录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取需要使用Python和BeautifulSoup登录并请求libs的网站.(无硒)这是我的代码:

I want to scrape website that requires login with Python and BeautifulSoup and requests libs. (no selenium) This is my code:

import requests
from bs4 import BeautifulSoup

auth = (username, password)
headers = {
    'authority': 'signon.springer.com',
    'cache-control': 'max-age=0',
    'upgrade-insecure-requests': '1',
    'origin': 'https://signon.springer.com',
    'content-type': 'application/x-www-form-urlencoded',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'referer': 'https://signon.springer.com/login?service=https%3A%2F%2Fpress.nature.com%2Fcallback%3Fclient_name%3DCasClienthttps%3A%2F%2Fpress.nature.com&locale=en&gtm=GTM-WDRMH37&message=This+page+is+only+accessible+for+approved+journalists.+Please+log+into+your+press+site+account.+For+more+information%3A+https%3A%2F%2Fpress.nature.com%2Fapprove-as-a-journalist&_ga=2.25951165.1431685211.1610963078-2026442578.1607341887',
    'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    'cookie': 'SESSION=40d2be77-b3df-4eb6-9f3b-dac31ab66ce3',
}

params = (
    ('service', 'https://press.nature.com/callback?client_name=CasClienthttps://press.nature.com'),
    ('locale', 'en'),
    ('gtm', 'GTM-WDRMH37'),
    ('message', 'This page is only accessible for approved journalists. Please log into your press site account. For more information: https://press.nature.com/approve-as-a-journalist'),
    ('_ga', '2.25951165.1431685211.1610963078-2026442578.1607341887'),
)

data = {
  'username': username,
  'password': password,
  'rememberMe': 'true',
  'lt': 'LT-95560-qF7CZnAtuDqWS1sFQgBMqPVifS5mTg-16c07928-2faa-4ce0-58c7-5a1f',
  'execution': 'e1s1',
  '_eventId': 'submit',
  'submit': 'Login'
}

session = requests.session()
response = session.post('https://signon.springer.com/login', headers=headers, params=params, data=data, auth = auth)
print(response)
#time.sleep(5) does not make any diference
soup = BeautifulSoup(response.content, 'html.parser')
print(soup) # im not getting the results that I want

我没有获得所需的HTML页面以及所需的所有数据,我获取的HTML页面是登录页面.这是HTML响应: https://www.codepile.net/pile/EGY0YQMv

I'm not getting required HTML page with all data that I want, the HTML page that I'm getting is login page. This is the HTML response: https://www.codepile.net/pile/EGY0YQMv

我认为问题是因为我想抓取此页面:

I think that the problem is because I want to scrape this page:

https://press.nature.com/press-releases

但是,当我单击该链接(并且我未登录)时,它会将我重定向到其他网站以登录:

But when I click on that link (and Im not logged in) it redirects me to different website for loggin in:

https://signon.springer.com/login

为获取所有我使用过的 headers params data 值:

For getting all headers and params and data values I have used:

inspect page -> network -> find login request -> copy cURL -> https://curl.trillworks.com/

我尝试了多种post和get方法,无论是否使用 auth 参数,我都尝试过,但是结果是相同的.我在做什么错了?

I have tried multiple post and get methods, I have tried with and without auth param, but result is the same. What am I doing wrong?

推荐答案

尝试运行填充您的 username password 字段的脚本,让我知道您得到了什么.如果仍然无法登录,请确保在发帖请求中使用其他标题.

Try running the script filling in your username and password fields and let me know what you get. If it still doesn't log you in, make sure to use additional headers within post requests.

import requests
from bs4 import BeautifulSoup

link = 'https://signon.springer.com/login'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,'html.parser')   
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}         
    #what the above line does is parse the keys and valuse available in the login form
    payload['username'] = username
    payload['password'] = password

    print(payload) #when you print this, you should see the required parameters within payload 

    s.post(link,data=payload)
    #as we have laready logged in, the login cookies are stored within the session
    #in our subsequesnt requests we are reusing the same session we have been using from the very beginning
    r = s.get('https://press.nature.com/press-releases')
    print(r.status_code)
    print(r.text)

这篇关于刮擦网站,要求使用BeautifulSoup登录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆