scrapy 在提取标题时无法正常工作 [英] scrapy isn't working right in extracting the title

查看:59
本文介绍了scrapy 在提取标题时无法正常工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这段代码中,我想抓取链接内的标题、副标题和数据,但有1 和 2 以外的页面上的问题是只抓取了 1 个项目.我只想提取那些标题为 delhivery 的条目

In this code I want to scrape title,subtitle and data inside the links but having issues on pages beyond 1 and 2 as getting only 1 item scraped.I want to extract only those entries having title as delhivery only

       import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from delhivery.items import DelhiveryItem




class criticspider(CrawlSpider):
    name = "delh"
    allowed_domains = ["consumercomplaints.in"]
    start_urls = ["http://www.consumercomplaints.in/?search=delhivery&page=2"]


    def parse(self, response):
        sites = response.xpath('//table[@width="100%"]')
        items = []

        for site in sites:
            item = DelhiveryItem()
            item['title'] = site.xpath('.//td[@class="complaint"]/a/span[@style="background-color:yellow"]/text()').extract()[0]
            #item['title'] = site.xpath('.//td[@class="complaint"]/a[text() = "%s Delivery Courier %s"]/text()').extract()[0]
            item['subtitle'] = site.xpath('.//td[@class="compl-text"]/div/b[1]/text()').extract()[0]


            item['date'] = site.xpath('.//td[@class="small"]/text()').extract()[0].strip()
            item['username'] = site.xpath('.//td[@class="small"]/a[2]/text()').extract()[0]
            item['link'] = site.xpath('.//td[@class="complaint"]/a/@href').extract()[0]
            if item['link']:
                if 'http://' not in item['link']:
                    item['link'] = urljoin(response.url, item['link'])
                yield scrapy.Request(item['link'],
                                     meta={'item': item},
                                     callback=self.anchor_page)

            items.append(item)

    def anchor_page(self, response):
        old_item = response.request.meta['item']

        old_item['data'] = response.xpath('.//td[@style="padding-bottom:15px"]/div/text()').extract()[0]


        yield old_item

推荐答案

你需要把 item['title'] 改成这样:

You need to change the item['title'] to this:

item['title'] = ''.join(site.xpath('//table[@width="100%"]//span[text() = "Delhivery"]/parent::*//text()').extract()[0])

还要编辑站点以仅提取所需的链接(包含Delhivery 的链接)

Also edit sites to this to extract the required links only (ones with Delhivery in it)

sites = response.xpath('//table//span[text()="Delhivery"]/ancestor::div')

所以我现在明白您需要在代码中添加分页规则.它应该是这样的:您只需要添加导入并从 项目的链接本身编写新的 xpath,例如 这个

so I understand now that you need to add a pagination rule to your code. it should be something like this: You just need to add your imports and write the new xpaths from the item's link itself, such as this one

class criticspider(CrawlSpider):
    name = "delh"
    allowed_domains = ["consumercomplaints.in"]
    start_urls = ["http://www.consumercomplaints.in/?search=delhivery"]

    rules = (
        # Extracting pages, allowing only links with page=number to be extracted 
        Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="pagelinks"]', ), allow=('page=\d+', ),unique=True),follow=True),

         # Extract links of items on each page the spider gets from the first rule
        Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="complaint"]', )), callback='parse_item'),
    )

    def parse_item(self, response):
        item = DelhiveryItem()
        #populate item object here the same way you did, this function will be called for each item link.
        #This meand that you'll be extracting data from pages like this one : 
        #http://www.consumercomplaints.in/complaints/delhivery-last-mile-courier-service-poor-delivery-service-c772900.html#c1880509
        item['title'] = response.xpath('<write xpath>').extract()[0]
        item['subtitle'] = response.xpath('<write xpath>').extract()[0]
        item['date'] = response.xpath('<write xpath>').extract()[0].strip()
        item['username'] = response.xpath('<write xpath>').extract()[0]
        item['link'] = response.url
        item['data'] = response.xpath('<write xpath>').extract()[0]
        yield item

此外,我建议您在编写 xpath 时,不要使用任何样式参数,尝试使用 @class 或 @id,如果这是唯一的方法,则仅使用 @width、@style 或任何样式参数.

Also I suggest when you write an xpath, that you don't use any styling parameters, try to use @class or @id, only use @width, @style or any styling params if it's the only way.

这篇关于scrapy 在提取标题时无法正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆