Scrapy 只抓取每个页面的第一个结果 [英] Scrapy only scraping first result of each page

查看：103 发布时间：2021/7/17 18:44:04 python web-scraping screen-scraping scrapy

本文介绍了Scrapy 只抓取每个页面的第一个结果的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前正在尝试运行以下代码，但它只会抓取每个页面的第一个结果.知道可能是什么问题吗?

I'm currently trying to run the following code but it keeps scraping only the first result of each page. Any idea what the issue may be?

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from firstproject.items import xyz123Item
import urlparse
from scrapy.http.request import Request

class MySpider(CrawlSpider):
    name = "xyz123"
    allowed_domains = ["www.xyz123.com.au"]
    start_urls = ["http://www.xyz123.com.au/",]

    rules = (Rule (SgmlLinkExtractor(allow=("",),restrict_xpaths=('//*[@id="1234headerPagination_hlNextLink"]',))
    , callback="parse_xyz", follow=True),
    )

    def parse_xyz(self, response):
        hxs = HtmlXPathSelector(response)
        xyz = hxs.select('//div[@id="1234SearchResults"]//div/h2')
        items = []
        for xyz in xyz:
            item = xyz123Item()
            item ["title"] = xyz.select('a/text()').extract()[0]
            item ["link"] = xyz.select('a/@href').extract()[0]
            items.append(item)
            return items

Basespider 版本可以很好地抓取第一页上的所有必需数据:

The Basespider version works well scraping ALL the required data on the first page:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from firstproject.items import xyz123

class MySpider(BaseSpider):
    name = "xyz123test"
    allowed_domains = ["xyz123.com.au"]
    start_urls = ["http://www.xyz123.com.au/"]


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//div[@id="1234SearchResults"]//div/h2')
        items = []
        for titles in titles:
            item = xyz123Item()
            item ["title"] = titles.select("a/text()").extract()
            item ["link"] = titles.select("a/@href").extract()
            items.append(item)
        return items

抱歉审查.出于隐私原因，我不得不审查该网站.

Sorry for the censoring. I had to censor the website for privacy reasons.

第一个代码按照我希望的方式在页面中爬行，但它只提取第一个项目标题和链接.注意:在 google 中使用检查元素"的第一个标题的 XPath 是:
//*[@id="xyz123SearchResults"]/div[1]/h2/a,
第二个是 //*[@id="xyz123SearchResults"]/div[2]/h2/a
第三个是//*[@id="xyz123SearchResults"]/div[3]/h2/a等

The first code crawls through the pages well the way I'd like it to crawl, however it only pulls the first item title and link. NOTE: The XPath of the first title using "inspect element" in google is:
//*[@id="xyz123SearchResults"]/div[1]/h2/a,
second is //*[@id="xyz123SearchResults"]/div[2]/h2/a
third is //*[@id="xyz123SearchResults"]/div[3]/h2/a etc.

我不确定 div[n] 位是否是杀死它的原因.我希望这很容易解决.

I'm not sure if the div[n] bit is what's killing it. I'm hoping it's an easy fix.

谢谢

Scrapy 只抓取每个页面的第一个结果 [英] Scrapy only scraping first result of each page

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy 只抓取每个页面的第一个结果 [英] Scrapy only scraping first result of each page

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭