使用 python3.6 抓取站点.我无法通过登录页面 [英] Scraping a site with python3.6. I can't progress past the login page

查看:28
本文介绍了使用 python3.6 抓取站点.我无法通过登录页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

网站的html表单代码:

 
<div class="form-group text-left"><label for="username">用户名:</label><input type="text" class="form-control" id="username" name="username" placeholder="" autocomplete="off" required/>

<div class="form-group text-left"><label for="password">密码:</label><input type="password" class="form-control" id="pass" name="pass" placeholder="" autocomplete="off" required/>

<input type="hidden" name="token" value="/bGbw4NKFT+Yk11t1bgXYg48G68oUeXcb9N4rQ6cEzE="><button type="submit" name="submit" class="btn btn-primary block full-width m-b">登录</button>

到目前为止足够简单.我过去抓取过许多网站都没有问题.

我尝试过:selenium、mechanize(虽然不得不回退到早期版本的python)、mechanicalsoup、requests.

我已阅读:SO 上的多篇帖子以及:https://kazuar.github.io/scraping-tutorial/http://docs.python-requests.org/en/latest/user/advanced/#session-objects还有更多.

示例代码:

导入请求从 lxml 导入 htmlsession_requests = requests.session()结果 = session_requests.get(url)树 = html.fromstring(result.text)真实性_令牌 = 列表(设置(tree.xpath(//输入[@name='token']/@value")))[0]结果 = session_requests.post(网址,数据 = 有效载荷,标头 = dict(referer=url))结果 = session_requests.get(url3)打印(结果.文本)

进口机械汤进口请求从 http 导入 cookiejarc = cookiejar.CookieJar()s = requests.Session()s.cookies = c浏览器 = Mechanicalsoup.Browser(session=s)login_page = browser.get(url)login_form = login_page.soup.find('form', {'method':'POST'})login_form.find('input', {'name': 'username'})['value'] = usernamelogin_form.find('input', {'name': 'pass'})['value'] = 密码response = browser.submit(login_form, login_page.url)

尽我所能,除了登录页面的 html 代码之外,我无法返回任何内容,而且我不知道接下来要探索哪里才能真正找出没有发生的事情以及原因.

url = 保存登录页面 url 的变量,url3 = 我想要抓取的页面.

任何帮助将不胜感激!

解决方案

您是否尝试过标题?

首先在浏览器上尝试并观察需要哪些标头并在请求中发送标头.标头是识别用户或客户的重要部分.

尝试从不同的IP,可能有人正在观看请求的IP.

试试这个例子.在这里,我使用 selenium 和 chrome 驱动程序.首先,我从 selenium 获取 cookie,并将其保存在一个文件中以备后用,然后我使用带有保存的 cookie 的请求来访问需要登录的页面.

from selenium import webdriver导入操作系统导入 demjson# 从给定位置下载 chromedriver 并放在某个可访问的位置并设置路径# utl 下载 Chrome 驱动程序 - https://chromedriver.storage.googleapis.com/index.html?path=2.27/chrompathforselenium = "/path/chromedriver"os.environ["webdriver.chrome.driver"]=chrompathforseleniumdriver=webdriver.Chrome(executable_path=chrompathforselenium)driver.set_window_size(1120, 550)driver.get(url1)driver.find_element_by_name("用户名").send_keys(用户名)driver.find_element_by_name("pass").send_keys(密码)# 你需要根据类属性找到如何访问按钮#这里我是根据ID做的driver.find_element_by_id("btnid").click()# 在此处设置您的可访问 cookiepath.cookiepath = ""cookies=driver.get_cookies()getCookies=open(cookiepath, "w+")getCookies.write(demjson.encode(cookies))getCookies.close()readCookie = open(cookiepath, 'r')cookieString = readCookie.read()cookie = demjson.decode(cookieString)标题 = {}# 写入所有标题headers.update({"key":"value"})response = requests.get(url3, headers=headers, cookies=cookie)# 检查你的回复

The html form code for the site:

                <form class="m-t" role="form" method="POST" action="">

                <div class="form-group text-left">
                    <label for="username">Username:</label>
                    <input type="text" class="form-control" id="username" name="username" placeholder="" autocomplete="off" required />
                </div>
                <div class="form-group text-left">
                    <label for="password">Password:</label>
                    <input type="password" class="form-control" id="pass" name="pass" placeholder="" autocomplete="off" required />
                </div>

                <input type="hidden" name="token" value="/bGbw4NKFT+Yk11t1bgXYg48G68oUeXcb9N4rQ6cEzE=">
                <button type="submit" name="submit" class="btn btn-primary block full-width m-b">Login</button>

Simple enough so far. I've scraped a number of sites in the past without issue.

I have tried: selenium, mechanize(albeit had to drop back to earlier version of python), mechanicalsoup, requests.

I have read: multiple posts here on SO as well as: https://kazuar.github.io/scraping-tutorial/ http://docs.python-requests.org/en/latest/user/advanced/#session-objects and many many more.

Sample code:

import requests
from lxml import html
session_requests = requests.session()
result = session_requests.get(url)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[@name='token']/@value")))[0]
result = session_requests.post(
    url, 
    data = payload, 
    headers = dict(referer=url)
)
result = session_requests.get(url3)
print(result.text)

and

import mechanicalsoup
import requests
from http import cookiejar

c = cookiejar.CookieJar()
s = requests.Session()
s.cookies = c
browser = mechanicalsoup.Browser(session=s)

login_page = browser.get(url)

login_form = login_page.soup.find('form', {'method':'POST'})

login_form.find('input', {'name': 'username'})['value'] = username
login_form.find('input', {'name': 'pass'})['value'] = password

response = browser.submit(login_form, login_page.url)

Try as I might I just cannot return anything other than the html code for the login page and I don't know where to explore next to actually figure out what's not happening and why.

url = variable that holds login page url, url3 = a page I want to scrape.

Any help would be much appreciated!

解决方案

Did you tried headers?

First try on the browser and observe what headers are required and send headers in the requests. Headers are important part to identify user or client.

Try from the different IP, may be someone is watching the requested Ip.

Try this example. Here I am using selenium and chrome driver. First I am getting cookie from selenium and I am saving that in a file for later purpose and then I am using requests with the saved cookie to access pages which requires login.

from selenium import webdriver
import os
import demjson

# download chromedriver from given location and put at some accessible location and set path
# utl to download chrome driver - https://chromedriver.storage.googleapis.com/index.html?path=2.27/
chrompathforselenium = "/path/chromedriver"

os.environ["webdriver.chrome.driver"]=chrompathforselenium
driver=webdriver.Chrome(executable_path=chrompathforselenium)
driver.set_window_size(1120, 550)

driver.get(url1)

driver.find_element_by_name("username").send_keys(username)
driver.find_element_by_name("pass").send_keys(password)

# you need to find how to access button on the basis of class attribute
# here I am doing on the basis of ID
driver.find_element_by_id("btnid").click()

# set your accessible cookiepath here.
cookiepath = ""

cookies=driver.get_cookies()
getCookies=open(cookiepath, "w+")
getCookies.write(demjson.encode(cookies))
getCookies.close()

readCookie = open(cookiepath, 'r')
cookieString = readCookie.read()
cookie = demjson.decode(cookieString)

headers = {}
# write all the headers
headers.update({"key":"value"})

response = requests.get(url3, headers=headers, cookies=cookie)
# check your response

这篇关于使用 python3.6 抓取站点.我无法通过登录页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
Python最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆