使用 python3.6 抓取站点.我无法通过登录页面 [英] Scraping a site with python3.6. I can't progress past the login page
问题描述
网站的html表单代码:
<div class="form-group text-left"><label for="password">密码:</label><input type="password" class="form-control" id="pass" name="pass" placeholder="" autocomplete="off" required/>
<input type="hidden" name="token" value="/bGbw4NKFT+Yk11t1bgXYg48G68oUeXcb9N4rQ6cEzE="><button type="submit" name="submit" class="btn btn-primary block full-width m-b">登录</button>
到目前为止足够简单.我过去抓取过许多网站都没有问题.
我尝试过:selenium、mechanize(虽然不得不回退到早期版本的python)、mechanicalsoup、requests.
我已阅读:SO 上的多篇帖子以及:https://kazuar.github.io/scraping-tutorial/http://docs.python-requests.org/en/latest/user/advanced/#session-objects还有更多.
示例代码:
导入请求从 lxml 导入 htmlsession_requests = requests.session()结果 = session_requests.get(url)树 = html.fromstring(result.text)真实性_令牌 = 列表(设置(tree.xpath(//输入[@name='token']/@value")))[0]结果 = session_requests.post(网址,数据 = 有效载荷,标头 = dict(referer=url))结果 = session_requests.get(url3)打印(结果.文本)
和
进口机械汤进口请求从 http 导入 cookiejarc = cookiejar.CookieJar()s = requests.Session()s.cookies = c浏览器 = Mechanicalsoup.Browser(session=s)login_page = browser.get(url)login_form = login_page.soup.find('form', {'method':'POST'})login_form.find('input', {'name': 'username'})['value'] = usernamelogin_form.find('input', {'name': 'pass'})['value'] = 密码response = browser.submit(login_form, login_page.url)
尽我所能,除了登录页面的 html 代码之外,我无法返回任何内容,而且我不知道接下来要探索哪里才能真正找出没有发生的事情以及原因.
url = 保存登录页面 url 的变量,url3 = 我想要抓取的页面.
任何帮助将不胜感激!
您是否尝试过标题?
首先在浏览器上尝试并观察需要哪些标头并在请求中发送标头.标头是识别用户或客户的重要部分.
尝试从不同的IP,可能有人正在观看请求的IP.
试试这个例子.在这里,我使用 selenium 和 chrome 驱动程序.首先,我从 selenium 获取 cookie,并将其保存在一个文件中以备后用,然后我使用带有保存的 cookie 的请求来访问需要登录的页面.
from selenium import webdriver导入操作系统导入 demjson# 从给定位置下载 chromedriver 并放在某个可访问的位置并设置路径# utl 下载 Chrome 驱动程序 - https://chromedriver.storage.googleapis.com/index.html?path=2.27/chrompathforselenium = "/path/chromedriver"os.environ["webdriver.chrome.driver"]=chrompathforseleniumdriver=webdriver.Chrome(executable_path=chrompathforselenium)driver.set_window_size(1120, 550)driver.get(url1)driver.find_element_by_name("用户名").send_keys(用户名)driver.find_element_by_name("pass").send_keys(密码)# 你需要根据类属性找到如何访问按钮#这里我是根据ID做的driver.find_element_by_id("btnid").click()# 在此处设置您的可访问 cookiepath.cookiepath = ""cookies=driver.get_cookies()getCookies=open(cookiepath, "w+")getCookies.write(demjson.encode(cookies))getCookies.close()readCookie = open(cookiepath, 'r')cookieString = readCookie.read()cookie = demjson.decode(cookieString)标题 = {}# 写入所有标题headers.update({"key":"value"})response = requests.get(url3, headers=headers, cookies=cookie)# 检查你的回复
The html form code for the site:
<form class="m-t" role="form" method="POST" action="">
<div class="form-group text-left">
<label for="username">Username:</label>
<input type="text" class="form-control" id="username" name="username" placeholder="" autocomplete="off" required />
</div>
<div class="form-group text-left">
<label for="password">Password:</label>
<input type="password" class="form-control" id="pass" name="pass" placeholder="" autocomplete="off" required />
</div>
<input type="hidden" name="token" value="/bGbw4NKFT+Yk11t1bgXYg48G68oUeXcb9N4rQ6cEzE=">
<button type="submit" name="submit" class="btn btn-primary block full-width m-b">Login</button>
Simple enough so far. I've scraped a number of sites in the past without issue.
I have tried: selenium, mechanize(albeit had to drop back to earlier version of python), mechanicalsoup, requests.
I have read: multiple posts here on SO as well as: https://kazuar.github.io/scraping-tutorial/ http://docs.python-requests.org/en/latest/user/advanced/#session-objects and many many more.
Sample code:
import requests
from lxml import html
session_requests = requests.session()
result = session_requests.get(url)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[@name='token']/@value")))[0]
result = session_requests.post(
url,
data = payload,
headers = dict(referer=url)
)
result = session_requests.get(url3)
print(result.text)
and
import mechanicalsoup
import requests
from http import cookiejar
c = cookiejar.CookieJar()
s = requests.Session()
s.cookies = c
browser = mechanicalsoup.Browser(session=s)
login_page = browser.get(url)
login_form = login_page.soup.find('form', {'method':'POST'})
login_form.find('input', {'name': 'username'})['value'] = username
login_form.find('input', {'name': 'pass'})['value'] = password
response = browser.submit(login_form, login_page.url)
Try as I might I just cannot return anything other than the html code for the login page and I don't know where to explore next to actually figure out what's not happening and why.
url = variable that holds login page url, url3 = a page I want to scrape.
Any help would be much appreciated!
Did you tried headers?
First try on the browser and observe what headers are required and send headers in the requests. Headers are important part to identify user or client.
Try from the different IP, may be someone is watching the requested Ip.
Try this example. Here I am using selenium and chrome driver. First I am getting cookie from selenium and I am saving that in a file for later purpose and then I am using requests with the saved cookie to access pages which requires login.
from selenium import webdriver
import os
import demjson
# download chromedriver from given location and put at some accessible location and set path
# utl to download chrome driver - https://chromedriver.storage.googleapis.com/index.html?path=2.27/
chrompathforselenium = "/path/chromedriver"
os.environ["webdriver.chrome.driver"]=chrompathforselenium
driver=webdriver.Chrome(executable_path=chrompathforselenium)
driver.set_window_size(1120, 550)
driver.get(url1)
driver.find_element_by_name("username").send_keys(username)
driver.find_element_by_name("pass").send_keys(password)
# you need to find how to access button on the basis of class attribute
# here I am doing on the basis of ID
driver.find_element_by_id("btnid").click()
# set your accessible cookiepath here.
cookiepath = ""
cookies=driver.get_cookies()
getCookies=open(cookiepath, "w+")
getCookies.write(demjson.encode(cookies))
getCookies.close()
readCookie = open(cookiepath, 'r')
cookieString = readCookie.read()
cookie = demjson.decode(cookieString)
headers = {}
# write all the headers
headers.update({"key":"value"})
response = requests.get(url3, headers=headers, cookies=cookie)
# check your response
这篇关于使用 python3.6 抓取站点.我无法通过登录页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!