Scrapy 只抓取每个页面的第一个结果 [英] Scrapy only scraping first result of each page

查看:103
本文介绍了Scrapy 只抓取每个页面的第一个结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试运行以下代码,但它只会抓取每个页面的第一个结果.知道可能是什么问题吗?

I'm currently trying to run the following code but it keeps scraping only the first result of each page. Any idea what the issue may be?

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from firstproject.items import xyz123Item
import urlparse
from scrapy.http.request import Request

class MySpider(CrawlSpider):
    name = "xyz123"
    allowed_domains = ["www.xyz123.com.au"]
    start_urls = ["http://www.xyz123.com.au/",]

    rules = (Rule (SgmlLinkExtractor(allow=("",),restrict_xpaths=('//*[@id="1234headerPagination_hlNextLink"]',))
    , callback="parse_xyz", follow=True),
    )

    def parse_xyz(self, response):
        hxs = HtmlXPathSelector(response)
        xyz = hxs.select('//div[@id="1234SearchResults"]//div/h2')
        items = []
        for xyz in xyz:
            item = xyz123Item()
            item ["title"] = xyz.select('a/text()').extract()[0]
            item ["link"] = xyz.select('a/@href').extract()[0]
            items.append(item)
            return items

Basespider 版本可以很好地抓取第一页上的所有必需数据:

The Basespider version works well scraping ALL the required data on the first page:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from firstproject.items import xyz123

class MySpider(BaseSpider):
    name = "xyz123test"
    allowed_domains = ["xyz123.com.au"]
    start_urls = ["http://www.xyz123.com.au/"]


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//div[@id="1234SearchResults"]//div/h2')
        items = []
        for titles in titles:
            item = xyz123Item()
            item ["title"] = titles.select("a/text()").extract()
            item ["link"] = titles.select("a/@href").extract()
            items.append(item)
        return items

抱歉审查.出于隐私原因,我不得不审查该网站.

Sorry for the censoring. I had to censor the website for privacy reasons.

第一个代码按照我希望的方式在页面中爬行,但它只提取第一个项目标题和链接.注意:在 google 中使用检查元素"的第一个标题的 XPath 是:
//*[@id="xyz123SearchResults"]/div[1]/h2/a,
第二个是 //*[@id="xyz123SearchResults"]/div[2]/h2/a
第三个是//*[@id="xyz123SearchResults"]/div[3]/h2/a

The first code crawls through the pages well the way I'd like it to crawl, however it only pulls the first item title and link. NOTE: The XPath of the first title using "inspect element" in google is:
//*[@id="xyz123SearchResults"]/div[1]/h2/a,
second is //*[@id="xyz123SearchResults"]/div[2]/h2/a
third is //*[@id="xyz123SearchResults"]/div[3]/h2/a etc.

我不确定 div[n] 位是否是杀死它的原因.我希望这很容易解决.

I'm not sure if the div[n] bit is what's killing it. I'm hoping it's an easy fix.

谢谢

推荐答案

 for xyz in xyz:
            item = xyz123Item()
            item ["title"] = xyz.select('a/text()').extract()[0]
            item ["link"] = xyz.select('a/@href').extract()[0]
            items.append(item)
            return items

您确定退货商品的缩进吗?应该少一个.

Are you sure about the indentation of the return items ? It should be one less.

这篇关于Scrapy 只抓取每个页面的第一个结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆