遵循超链接和“过滤的异地请求” [英] Following hyperlink and "Filtered offsite request"

查看：221 发布时间：2016/11/10 10:18:30 python callback web-scraping scrapy

本文介绍了遵循超链接和“过滤的异地请求”的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我知道有几个相关的线程，他们已经帮助了我很多，但我仍然不能得到所有的方式。我在运行代码不会导致错误，但我没有在我的 csv 文件。我有以下 Scrapy 蜘蛛，从一个网页开始，然后跟随一个超链接，并刮掉链接的页面：

I know that there are several related threads out there, and they have helped me a lot, but I still can't get all the way. I am at the point where running the code doesn't result in errors, but I get nothing in my csv file. I have the following Scrapy spider that starts on one webpage, then follows a hyperlink, and scrapes the linked page:

from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field

class bbrItem(Item):
    Year = Field()
    AppraisalDate = Field()
    PropertyValue = Field()
    LandValue = Field()
    Usage = Field()
    LandSize = Field()
    Address = Field()    

class spiderBBRTest(BaseSpider):
    name = 'spiderBBRTest'
    allowed_domains = ["http://boliga.dk"]
    start_urls = ['http://www.boliga.dk/bbr/resultater?sort=hus_nr_sort-a,etage-a,side-a&gade=Septembervej&hus_nr=29&ipostnr=2730']

    def parse2(self, response):        
        hxs = HtmlXPathSelector(response)
        bbrs2 = hxs.select("id('evaluationControl')/div[2]/div")
        bbrs = iter(bbrs2)
        next(bbrs)
        for bbr in bbrs:
            item = bbrItem()
            item['Year'] = bbr.select("table/tbody/tr[1]/td[2]/text()").extract()
            item['AppraisalDate'] = bbr.select("table/tbody/tr[2]/td[2]/text()").extract()
            item['PropertyValue'] = bbr.select("table/tbody/tr[3]/td[2]/text()").extract()
            item['LandValue'] = bbr.select("table/tbody/tr[4]/td[2]/text()").extract()
            item['Usage'] = bbr.select("table/tbody/tr[5]/td[2]/text()").extract()
            item['LandSize'] = bbr.select("table/tbody/tr[6]/td[2]/text()").extract()
            item['Address']  = response.meta['address']
            yield item

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        PartUrl = ''.join(hxs.select("id('searchresult')/tr/td[1]/a/@href").extract())
        url2 = ''.join(["http://www.boliga.dk", PartUrl])
        yield Request(url=url2, meta={'address': hxs.select("id('searchresult')/tr/td[1]/a[@href]/text()").extract()}, callback=self.parse2)

结果到一个csv文件，但我没有什么文件。但是，运行代码不会导致任何错误。我知道这只是一个简单的例子，只有一个URL，但它说明了我的问题。

I am trying to export the results to a csv file, but I get nothing the file. Running the code, however, doesn't result in any errors. I know it's a simplyfied example with only one URL, but it illustrates my problem.

我认为我的问题可能是我不告诉 Scrapy c> c>


I think my problem could be that I am not telling Scrapy that I want to save the data in the Parse2 method.
运行蜘蛛作为 scrapy crawl spiderBBR -o scraped_data.csv -t csv  
推荐答案
您需要修改您在 parse 中产生的请求才能使用 parse2 作为其回调。 
You need to modify your yielded Request in parse to use parse2 as its callback. 
编辑： allowed_domains 不应包含http前缀例如：
 allowed_domains shouldn't include the http prefix eg:
allowed_domains = ["boliga.dk"]

尝试一下，看看你的蜘蛛是否仍然正常运行，而不是让 allowed_domains 空白
Try that and see if your spider still runs correctly instead of leaving allowed_domains blank

                        这篇关于遵循超链接和“过滤的异地请求”的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

遵循超链接和“过滤的异地请求” [英] Following hyperlink and "Filtered offsite request"

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

遵循超链接和“过滤的异地请求” [英] Following hyperlink and &quot;Filtered offsite request&quot;

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

遵循超链接和“过滤的异地请求” [英] Following hyperlink and "Filtered offsite request"

登录关闭