跟随超链接和“过滤的异地请求" [英] Following hyperlink and "Filtered offsite request"
问题描述
我知道那里有几个相关的线程,它们对我帮助很大,但我仍然无法一路走下去.我现在运行代码不会导致错误,但我的 csv
文件中没有任何内容.我有以下 Scrapy
蜘蛛,它从一个网页开始,然后跟随一个超链接,并抓取链接的页面:
I know that there are several related threads out there, and they have helped me a lot, but I still can't get all the way. I am at the point where running the code doesn't result in errors, but I get nothing in my csv
file. I have the following Scrapy
spider that starts on one webpage, then follows a hyperlink, and scrapes the linked page:
from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class bbrItem(Item):
Year = Field()
AppraisalDate = Field()
PropertyValue = Field()
LandValue = Field()
Usage = Field()
LandSize = Field()
Address = Field()
class spiderBBRTest(BaseSpider):
name = 'spiderBBRTest'
allowed_domains = ["http://boliga.dk"]
start_urls = ['http://www.boliga.dk/bbr/resultater?sort=hus_nr_sort-a,etage-a,side-a&gade=Septembervej&hus_nr=29&ipostnr=2730']
def parse2(self, response):
hxs = HtmlXPathSelector(response)
bbrs2 = hxs.select("id('evaluationControl')/div[2]/div")
bbrs = iter(bbrs2)
next(bbrs)
for bbr in bbrs:
item = bbrItem()
item['Year'] = bbr.select("table/tbody/tr[1]/td[2]/text()").extract()
item['AppraisalDate'] = bbr.select("table/tbody/tr[2]/td[2]/text()").extract()
item['PropertyValue'] = bbr.select("table/tbody/tr[3]/td[2]/text()").extract()
item['LandValue'] = bbr.select("table/tbody/tr[4]/td[2]/text()").extract()
item['Usage'] = bbr.select("table/tbody/tr[5]/td[2]/text()").extract()
item['LandSize'] = bbr.select("table/tbody/tr[6]/td[2]/text()").extract()
item['Address'] = response.meta['address']
yield item
def parse(self, response):
hxs = HtmlXPathSelector(response)
PartUrl = ''.join(hxs.select("id('searchresult')/tr/td[1]/a/@href").extract())
url2 = ''.join(["http://www.boliga.dk", PartUrl])
yield Request(url=url2, meta={'address': hxs.select("id('searchresult')/tr/td[1]/a[@href]/text()").extract()}, callback=self.parse2)
我正在尝试将结果导出到 csv 文件,但我没有得到该文件.但是,运行代码不会导致任何错误.我知道这是一个只有一个 URL 的简单示例,但它说明了我的问题.
I am trying to export the results to a csv file, but I get nothing the file. Running the code, however, doesn't result in any errors. I know it's a simplyfied example with only one URL, but it illustrates my problem.
我认为我的问题可能是我没有告诉 Scrapy
我想在 Parse2
方法中保存数据.
I think my problem could be that I am not telling Scrapy
that I want to save the data in the Parse2
method.
顺便说一句,我以 scrapy crawl spiderBBR -o scraped_data.csv -t csv
推荐答案
您需要修改 parse
中产生的 Request
以使用 parse2
作为它的回调.
You need to modify your yielded Request
in parse
to use parse2
as its callback.
allowed_domains
不应包含 http 前缀,例如:
allowed_domains
shouldn't include the http prefix eg:
allowed_domains = ["boliga.dk"]
试试看你的蜘蛛是否仍然正常运行,而不是将 allowed_domains
留空
Try that and see if your spider still runs correctly instead of leaving allowed_domains
blank
这篇关于跟随超链接和“过滤的异地请求"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!