使用scrapy登录网站 [英] Login to website using scrapy
问题描述
我正在写一个蜘蛛.在其中我试图通过登录该网站使用抓取来抓取网站.我写了一个蜘蛛,但在登录网站时仍然遇到问题.我已经写了整个蜘蛛,但无法解决登录问题.请查看我的代码.
I am writing a spider. In which I am trying to scraping a website using scraping by logging into that website. I have write a spider but still getting problem in logging into the website. I had write the whole spider but can't resolve the issue of getting logging in. Please have a look at my code.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
class ScotlandSpider(scrapy.Spider):
name = 'scotland'
allowed_domains = ['www.whoownsscotland.org.uk']
login_url = r'http://www.whoownsscotland.org.uk/login.php?p=%2Fsearch.php'
start_urls = ['http://www.whoownsscotland.org.uk/search.php']
def login(self , response):
data = {
'name' : 'USERNAME',
'pass' : 'PASSWORD',
'previous' : r'%2Fsearch.php',
'login' : 'login'
}
yield FormRequest(url=self.login_url, formdata=data ,callback=self.parse)
def parse(self, response):
open_in_browser(response)
links = response.xpath('//p/a/@href').extract()
for link in links:
absoulute_url = response.urljoin(link)
yield scrapy.Request(absoulute_url , callback=self.parse_links)
def parse_links(self , response):
cities = response.xpath('//*[@id="layout-right"]/table/tr/td/p/a/@href').extract()
for city in cities:
absoulute_url_new = response.urljoin(city)
yield scrapy.Request(absoulute_url_new , callback=self.parse_cities)
def parse_cities(self , response):
record = response.xpath('//*[@id="layout-left"]/table/tr')
estate = record[0].xpath('.//td/text()').extract()
courty = record[1].xpath('.//td/text()').extract()
grid_ref = record[2].xpath('.//td/text()').extract()
acreage = record[3].xpath('.//td/text()').extract()
os_15 = record[4].xpath('.//td/text()').extract()
owner = record[5].xpath('.//td/text()').extract()
owner_address = record[6].xpath('.//td/text()').extract()
property_address = record[7].xpath('.//td/text()').extract()
website = record[8].xpath('.//td/text()').extract()
further_info = record[9].xpath('.//td//text()').extract()
contacts = record[10].xpath('.//td//text()').extract()
regsiters_sheet = record[11].xpath('.//td//text()').extract()
regsiters_certificate = record[12].xpath('.//td//text()').extract()
currency_of_data = record[13].xpath('.//td//text()').extract()
yield {
"Estate" : estate,
"County" : courty,
"Grid Reference" : grid_ref,
"Acreage" : acreage,
"OS 1:50k Sheet" : os_15,
"Owner" : owner,
"Owner Address" : owner_address,
"Property Address" : property_address,
"Website" : website,
"Further Information" : further_info,
"Contacts" : contacts,
"Registers of Scotland Sasines Search Sheet No" : regsiters_sheet,
"Registers of Scotland Land Certificate No" : regsiters_certificate ,
"Currency of Data" : currency_of_data
}
推荐答案
问题很简单:您已经创建了 login()
方法,但您从未调用它.
The problem is simple: you've created the login()
method, but you never call it.
解决此问题的最简单方法是将该方法重命名为 start_requests()
.
然后这个方法会被scrapy调用来生成初始请求,而不是从start_urls
生成它们.
The simplest way to solve this is to rename that method to start_requests()
.
This method will then be called by scrapy to generate the initial requests, instead of generating them from start_urls
.
这篇关于使用scrapy登录网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!