Scrapy CrawlSpider 什么都不爬 [英] Scrapy CrawlSpider Crawls Nothing

查看:23
本文介绍了Scrapy CrawlSpider 什么都不爬的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取 Booking.Com.蜘蛛打开和关闭没有打开和爬取url.[输出][1][1]:

简而言之,通用蜘蛛和它们爬行蜘蛛的区别在于,在使用爬行蜘蛛时,您使用链接提取器等模块和规则设置某些参数,以便当起始 URL 遵循其中的模式时用于浏览页面,使用各种有用的参数来做到这一点......其中最后一个规则集是您将汽车返回到的规则集.i换句话说......爬行蜘蛛为请求创建了根据需要导航的逻辑.

请注意,在规则集中.... 我输入... "/page." .... using "." 是一个正则表达式,表示....从我所在的页面......此页面上的任何链接遵循模式....../页面"它将遵循AND回调到parse_item......"

这是一个超级简单的例子......因为你可以输入模式来跟随或回调你的项目解析函数......

使用普通蜘蛛,您必须手动锻炼站点导航才能获得所需的内容...

<小时>

爬行蜘蛛

# -*- 编码:utf-8 -*-导入scrapy从scrapy.linkextractors 导入LinkExtractor从 scrapy.spider 导入 CrawlSpider,规则从quotes.items导入QuotesItem类 QcrawlSpider(CrawlSpider):名称 = 'qCrawl'allowed_domains = ['quotes.toscrape.com']start_urls = ['http://quotes.toscrape.com/']规则 = (规则(LinkExtractor(allow=r'page/.*'), callback='parse_item', follow=True),)def parse_item(self, response):item = QuotesItem()item['quote'] =response.css('span.text::text').extract()item['author'] = response.css('small.author::text').extract()产量项目

<小时>

通用蜘蛛

导入scrapy从quotes.items导入QuotesItem类QspiSpider(scrapy.Spider):名称 = "qSpi"allowed_domains = ["quotes.toscrape.com"]start_urls = ['http://quotes.toscrape.com']定义解析(自我,响应):在 response.css("div.quote") 中引用:item = QuotesItem()item['quote'] = quote.css('span.text::text').extract()item['author'] = quote.css('small.author::text').extract()item['tags'] = quote.css("div.tags > a.tag::text").extract()产量项目对于 response.css('li.next a::attr(href)').extract() 中的 nextPage:产生scrapy.Request(response.urljoin(nextPage))

一个

<小时>

OP 要求的附加信息

<块引用>

...我无法理解如何向规则参数添加参数"

好吧...让我们看看官方文档只是重申爬行蜘蛛的定义...

因此,爬行蜘蛛通过使用规则集创建了以下链接背后的逻辑...现在假设我想用爬行蜘蛛爬行 craigslist 仅用于出售家居用品....我希望您注意红色的东西....

首先要表明当我在 craigslist 家居用品页面时

所以我们认为...search/hsh..."中的任何内容都将是家庭物品清单的页面,从提货页面的第一页开始.

对于红色的大数字2"...是表示当我们在实际发布的项目中...所有项目似乎都有.../hsh/..."这样里面的任何链接具有这种模式的预览页面我想跟随并从那里抓取......所以我的蜘蛛会像......

导入scrapy从scrapy.linkextractors 导入LinkExtractor从 scrapy.spider 导入 CrawlSpider,规则从 craigListCrawl.items 导入 CraiglistcrawlItem类 CcrawlexSpider(CrawlSpider):名称 = 'cCrawlEx'allowed_domains = ['columbia.craigslist.org']start_urls = ['https://columbia.craigslist.org/']规则 = (规则(LinkExtractor(allow=r'search/hsa.*'), follow=True),规则(LinkExtractor(allow=r'hsh.*'), callback='parse_item'),)def parse_item(self, response):项目 = CraiglistcrawlItem()item['title'] = response.css('title::text').extract()item['description'] = response.xpath("//meta[@property='og:description']/@content").extract()item['followLink'] = response.xpath("//meta[@property='og:url']/@content").extract()产量项目

我想让你把它想象成你从登陆页面到你的内容页面所在位置的步骤......所以我们登陆了我们的 start_url 页面......所以我们说房子Hold Items 有一个模式,所以你可以看到第一条规则......

<块引用>

规则(LinkExtractor(allow=r'search/hsa.*'), follow=True)

这里说允许遵循正则表达式模式search/hsa."......记住."是一个正则表达式,它匹配search/hsa"in之后的任何内容至少这种情况.

所以逻辑继续,然后说任何具有模式hsh.*"的链接都将被回调到我的 parse_item

如果你认为它是从一个页面到另一个页面的步骤,就点击"而言,它应该会有所帮助......虽然完全可以接受,但通用蜘蛛将为您提供最大程度的控制,就您的scrapy项目的资源而言最终 upusing 意味着一个写得好的蜘蛛应该更精确和更快.

I am trying to crawl Booking.Com. The spider opens and closes without opening and crawling the url.[Output][1] [1]: https://i.stack.imgur.com/9hDt6.png I am new to python and Scrapy. Here is the code I have written so far. Please point out what I am doing wrong.

import scrapy
import urllib
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.loader import ItemLoader
from CinemaScraper.items import CinemascraperItem


class trip(CrawlSpider):
 name="tripadvisor"

def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)


def parse(self, response):
        reviewsurl = response.xpath('//a[@class="show_all_reviews_btn"]/@href')
        url = response.urljoin(reviewsurl[0].extract())
        self.pageNumber = 1
        return scrapy.Request(url, callback=self.parse_reviews)


def parse_reviews(self, response):
     for rev in response.xpath('//li[starts-with(@class,"review_item")]'):
            item =CinemascraperItem()
            #sometimes the title is empty because of some reason, not sure when it happens but this works
            title = rev.xpath('.//*[@class="review_item_header_content"]/span[@itemprop="name"]/text()')
            if title:
                item['title'] = title[0].extract()
                positive_content = rev.xpath('.//p[@class="review_pos"]//span/text()')
                if positive_content:
                    item['positive_content'] = positive_content[0].extract()
                negative_content = rev.xpath('.//p[@class="review_neg"]/span/text()')
                if negative_content:
                    item['negative_content'] = negative_content[0].extract()
                item['score'] = rev.xpath('./*[@class="review_item_header_score_container"]/span')[0].extract()
                #tags are separated by ;
                item['tags'] = ";".join(rev.xpath('.//ul[@class="review_item_info_tags/text()').extract())
                yield item

     next_page = response.xpath('//a[@id="review_next_page_link"]/@href')
     if next_page:
      url = response.urljoin(next_page[0].extract())
      yield scrapy.Request(url, self.parse_reviews)

解决方案

I like to point out that in your question you speak of a website booking.com but in your spider you have the the links to the website of which are the official documents for scrapy's tutorials... Will continue to use the quotes site for the sake of explanation ....

Okay, here we go... So in your code snippet you are using a crawl spider, of which is worth mentioning that the parse function is already a part of the logic behind the Crawl spider. Like I mentioned earlier, by renaming your parse to different name such as parse_item which is the default initial function when you create the scroll spider but truthfully you can name it whatever you want. By doing so I believe I should actually crawl the site but it's all depends on your code being correct.

In a nutshell, the difference between a generic spider and they crawl spider is that when using the crawl spider you use modules such as link extractor and rules of which set certain parameters so that when the start URL follows the pattern of which is used to navigate through the page, with various helpful argument to do just that... Of which the last rule set is the one with the car back to which you polish them. iIn other words... crawl spider creates the logic for request to navigate as desired.

Notice that inthe rules set.... I enter ... "/page." .... using "." is a regular expression that says.... "From the page im in... anyt links on this page that follow the pattern ..../page" it will follow AND callback to parse_item..."

This is A SUPER simple example ... as you can enter the patter to JUST follow or JUST callback to you item parse function...

with normal spider you manuallyhave to workout the site navigation to get the desired content you wish...


CrawlSpider

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from quotes.items import QuotesItem

class QcrawlSpider(CrawlSpider):
    name = 'qCrawl'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    rules = (
        Rule(LinkExtractor(allow=r'page/.*'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = QuotesItem()
        item['quote'] =response.css('span.text::text').extract()
        item['author'] = response.css('small.author::text').extract()
        yield item


Generic Spider

import scrapy
from quotes.items import QuotesItem

class QspiSpider(scrapy.Spider):
    name = "qSpi"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        for quote in response.css("div.quote"):
            item = QuotesItem()
            item['quote'] = quote.css('span.text::text').extract()
            item['author'] = quote.css('small.author::text').extract()
            item['tags'] = quote.css("div.tags > a.tag::text").extract()
            yield item

        for nextPage in response.css('li.next a::attr(href)').extract():
            yield scrapy.Request(response.urljoin(nextPage))

a


EDIT: Additional info at request of OP

"...I cannot understand how to add arguments to the Rule parameters"

Okay... lets look at the official documentation just to reiterate the crawl spiders definition...

So crawl spiders create the logic behind following links by using the rule set... now lets say I want to crawl craigslist with a crawl spider for only house hold items for sale.... I want you to take notice of the to things in red....

For number one is to show that when im on craigslist house hold items page

SO we gather that ... anything in "search/hsh..." will be pages for house hold items list, from the first page from the lading page.

For the big red number "2"... is to show that when we are in the actual items posted... all items seem to have ".../hsh/..." so that any links inside the previs page that has this patter I want to follow and scrape from there ... SO my spider would be something like ...

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from craigListCrawl.items import CraiglistcrawlItem

class CcrawlexSpider(CrawlSpider):
    name = 'cCrawlEx'
    allowed_domains = ['columbia.craigslist.org']
    start_urls = ['https://columbia.craigslist.org/']

    rules = (
        Rule(LinkExtractor(allow=r'search/hsa.*'), follow=True),
        Rule(LinkExtractor(allow=r'hsh.*'), callback='parse_item'),
    )

    def parse_item(self, response):
        item = CraiglistcrawlItem()
        item['title'] = response.css('title::text').extract()
        item['description'] = response.xpath("//meta[@property='og:description']/@content").extract()
        item['followLink'] = response.xpath("//meta[@property='og:url']/@content").extract()
        yield item

I want you to think of it like steps you take to get from the landingpage to where your page with the content is... So we landed on the page which is our start_url... tSo the we said that the House Hold Items has a patter so As you can see for the first rule...

Rule(LinkExtractor(allow=r'search/hsa.*'), follow=True)

Here it says allow the regular expression patter "search/hsa." be followed ... remember that "." is a regular expression that is to match anything after "search/hsa"in this case atleast.

So the logic continues and then say that any link with the pattern "hsh.*" is to be calledback to my parse_item

If you think of it as steps from on page to an other as far as "clicks" it takes it should help... though perfectly acceptable, generic spiders will give you the most control as far as resources your scrapy project will end upusing meaning that a well written spider should be more precise and far faster.

这篇关于Scrapy CrawlSpider 什么都不爬的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆