刮擦网站,要求使用BeautifulSoup登录 [英] Scrape website that require login with BeautifulSoup
问题描述
我想抓取需要使用Python和BeautifulSoup登录并请求libs的网站.(无硒)这是我的代码:
I want to scrape website that requires login with Python and BeautifulSoup and requests libs. (no selenium) This is my code:
import requests
from bs4 import BeautifulSoup
auth = (username, password)
headers = {
'authority': 'signon.springer.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'origin': 'https://signon.springer.com',
'content-type': 'application/x-www-form-urlencoded',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://signon.springer.com/login?service=https%3A%2F%2Fpress.nature.com%2Fcallback%3Fclient_name%3DCasClienthttps%3A%2F%2Fpress.nature.com&locale=en>m=GTM-WDRMH37&message=This+page+is+only+accessible+for+approved+journalists.+Please+log+into+your+press+site+account.+For+more+information%3A+https%3A%2F%2Fpress.nature.com%2Fapprove-as-a-journalist&_ga=2.25951165.1431685211.1610963078-2026442578.1607341887',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
'cookie': 'SESSION=40d2be77-b3df-4eb6-9f3b-dac31ab66ce3',
}
params = (
('service', 'https://press.nature.com/callback?client_name=CasClienthttps://press.nature.com'),
('locale', 'en'),
('gtm', 'GTM-WDRMH37'),
('message', 'This page is only accessible for approved journalists. Please log into your press site account. For more information: https://press.nature.com/approve-as-a-journalist'),
('_ga', '2.25951165.1431685211.1610963078-2026442578.1607341887'),
)
data = {
'username': username,
'password': password,
'rememberMe': 'true',
'lt': 'LT-95560-qF7CZnAtuDqWS1sFQgBMqPVifS5mTg-16c07928-2faa-4ce0-58c7-5a1f',
'execution': 'e1s1',
'_eventId': 'submit',
'submit': 'Login'
}
session = requests.session()
response = session.post('https://signon.springer.com/login', headers=headers, params=params, data=data, auth = auth)
print(response)
#time.sleep(5) does not make any diference
soup = BeautifulSoup(response.content, 'html.parser')
print(soup) # im not getting the results that I want
我没有获得所需的HTML页面以及所需的所有数据,我获取的HTML页面是登录页面.这是HTML响应: https://www.codepile.net/pile/EGY0YQMv
I'm not getting required HTML page with all data that I want, the HTML page that I'm getting is login page. This is the HTML response: https://www.codepile.net/pile/EGY0YQMv
我认为问题是因为我想抓取此页面:
I think that the problem is because I want to scrape this page:
https://press.nature.com/press-releases
但是,当我单击该链接(并且我未登录)时,它会将我重定向到其他网站以登录:
But when I click on that link (and Im not logged in) it redirects me to different website for loggin in:
https://signon.springer.com/login
为获取所有我使用过的 headers
和 params
和 data
值:
For getting all headers
and params
and data
values I have used:
inspect page -> network -> find login request -> copy cURL -> https://curl.trillworks.com/
我尝试了多种post和get方法,无论是否使用 auth
参数,我都尝试过,但是结果是相同的.我在做什么错了?
I have tried multiple post and get methods, I have tried with and without auth
param, but result is the same.
What am I doing wrong?
推荐答案
尝试运行填充您的 username
和 password
字段的脚本,让我知道您得到了什么.如果仍然无法登录,请确保在发帖请求中使用其他标题.
Try running the script filling in your username
and password
fields and let me know what you get. If it still doesn't log you in, make sure to use additional headers within post requests.
import requests
from bs4 import BeautifulSoup
link = 'https://signon.springer.com/login'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,'html.parser')
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
#what the above line does is parse the keys and valuse available in the login form
payload['username'] = username
payload['password'] = password
print(payload) #when you print this, you should see the required parameters within payload
s.post(link,data=payload)
#as we have laready logged in, the login cookies are stored within the session
#in our subsequesnt requests we are reusing the same session we have been using from the very beginning
r = s.get('https://press.nature.com/press-releases')
print(r.status_code)
print(r.text)
这篇关于刮擦网站,要求使用BeautifulSoup登录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!