没有下一个链接的 Scrapy 解析分页 [英] Scrapy parse pagination without next link

查看：57 发布时间：2021/6/14 19:39:53 parsing scrapy web-crawler

本文介绍了没有下一个链接的 Scrapy 解析分页的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试在没有下一个链接的情况下解析分页.html是亲爱的:


<ul><li><a href="//www.demopage.com/category_product_seo_name";class=page-1">1</a><li><a href=//www.demopage.com/category_product_seo_name?page=2";class=page-2">2</a><li><a href="//www.demopage.com/category_product_seo_name?page=3";class=page-3">3<li><a href="//www.demopage.com/category_product_seo_name?page=4";class=page-4 active">4</a><li><a href="//www.demopage.com/category_product_seo_name?page=5";class=page-5"＞5<li><a href="//www.demopage.com/category_product_seo_name?page=6";class=page-6">6<li><span class="page-...三点">...</span><li><a href=//www.demopage.com/category_product_seo_name?page=50";class=page-50">50</a>

对于这个 html，我尝试了这个 xpath:

response.xpath('//div[@class="pagination"]/ul/li/a/@href').extract()或者response.xpath('//div[@class="pagination"]/ul/li/a/@href/following-sibling::a[1]/@href').extract()

有什么好办法解析这个分页吗?谢谢大家.

PS:我也检查了这个答案:

答案 1

答案 2

解决方案

一种解决方案是抓取 x 页数，但如果总页数不是恒定的，这并不总是一个好的解决方案:

class MySpider(scrapy.spider):num_pages = 10def start_requests(self):请求 = []对于 i 在范围内(1，self.num_pages)requests.append(scrapy.Request(url='www.demopage.com/category_product_seo_name?page={0}'.format(i)))退货要求定义解析(自我，响应):#parse 页面在这里.

更新

您还可以跟踪页数并执行类似操作.a[href~=?page=2"]::attr(href) 将针对 a 元素，其中 href 属性包含字符串指定的.(我目前无法测试此代码是否有效，但应该可以这样做)

class MySpider(scrapy.spider):start_urls = ['https://demopage.com/search?p=1']页数 = 1定义解析(自我，响应):self.page_count += 1#解析响应next_url = response.css('#pagination > ul > li > a[href~=?page={0}"]::attr(href)'.format(self.page_count))如果 next_url:产量scrapy.Request(url = next_url)

I'm trying to parse a pagination without next link. The html is belove:

<div id="pagination" class="pagination">
    <ul>
        <li>
            <a href="//www.demopage.com/category_product_seo_name" class="page-1 ">1</a>
        </li>
        <li>
            <a href="//www.demopage.com/category_product_seo_name?page=2" class="page-2 ">2</a>
        </li>
        <li>
            <a href="//www.demopage.com/category_product_seo_name?page=3" class="page-3 ">3</a>
        </li>
        <li>
            <a href="//www.demopage.com/category_product_seo_name?page=4" class="page-4 active">4</a>
        </li>
        <li>
            <a href="//www.demopage.com/category_product_seo_name?page=5" class="page-5">5</a>
        </li>
        <li>
            <a href="//www.demopage.com/category_product_seo_name?page=6" class="page-6 ">6</a>
        </li>
        <li>
                <span class="page-... three-dots">...</span>
        </li>
        <li>
           <a href="//www.demopage.com/category_product_seo_name?page=50" class="page-50 ">50</a>
        </li>
    </ul>   
</div>

For this html I have try this xpath:

response.xpath('//div[@class="pagination"]/ul/li/a/@href').extract()
or 
response.xpath('//div[@class="pagination"]/ul/li/a/@href/following-sibling::a[1]/@href').extract()

is there a good way to parse this pagination? Thanks for all.

PS: I have checked this answers too:

Answer 1

Answer 2

解决方案

One solution is to scrape x number of pages, but this isn't always a good solution if the total number of pages isn't constant:

class MySpider(scrapy.spider):
    num_pages = 10
    def start_requests(self):
        requests = []
        for i in range(1, self.num_pages)
            requests.append(scrapy.Request(
                url='www.demopage.com/category_product_seo_name?page={0}'.format(i)
            ))
        return requests

    def parse(self, response):
        #parse pages here.

Update

You can also keep track of the page count and do something like this. a[href~="?page=2"]::attr(href) will target a elements which href attribute contains the string specified. (I'm not currently able to test if this code works, but something in the style of this should do it)

class MySpider(scrapy.spider):
    start_urls = ['https://demopage.com/search?p=1']
    page_count = 1


def parse(self, response):
     self.page_count += 1
     #parse response

     next_url = response.css('#pagination > ul > li > a[href~="?page={0}"]::attr(href)'.format(self.page_count))
     if next_url:
         yield scrapy.Request(
             url = next_url
         )

这篇关于没有下一个链接的 Scrapy 解析分页的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

没有下一个链接的 Scrapy 解析分页 [英] Scrapy parse pagination without next link

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

没有下一个链接的 Scrapy 解析分页 [英] Scrapy parse pagination without next link

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭