具有多个页面的 Scrapy [英] Scrapy with multiple pages

查看:33
本文介绍了具有多个页面的 Scrapy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个简单的scrapy项目,在其中,我从初始站点example.com/full获取了总页数.现在我需要抓取从 example.com/page-2 开始的所有页面到 100(如果总页数为 100).我该怎么做?

任何建议都会有所帮助.

代码:

导入scrapy类 AllSpider(scrapy.Spider):名称 = '全部'allowed_domains = ['example.com']start_urls = ['https://example.com/full/']总页数 = 0定义解析(自我,响应):total_pages = response.xpath("//body/section/div/section/div/div/ul/li[6]/a/text()").extract_first()#urls = ('https://example.com/page-{}'.format(i) for i in range(1,total_pages))打印(总页数)

更新 #1:

我尝试使用 urls = ('https://example.com/page-{}'.format(i) for i in range(1,total_pages)) 但它不起作用,可能是我做错了什么.

更新#2:我已经像这样改变了我的代码

class AllSpider(scrapy.Spider):名称 = '全部'allowed_domains = ['sanet.st']start_urls = ['https://sanet.st/full/']总页数 = 0定义解析(自我,响应):total_pages = response.xpath("//body/section/div/section/div/div/ul/li[6]/a/text()").extract_first()对于范围内的页面(2,int(total_pages)):url = 'https://sanet.st/page-'+str(page)产量scrapy.Request(网址)title = response.xpath('//*[@class="list_item_title"]/h2/a/span/text()').extract()打印(标题)

但仍然重复只显示第一页标题的循环.我需要从不同的页面中提取标题并在提示中打印出来.我该怎么做?

解决方案

您必须搜索 'next_page' 对象并在它位于页面上时继续循环.

# -*- 编码:utf-8 -*-导入scrapy从scrapy.http导入请求类 SanetSpider(scrapy.Spider):名称 = 'sanet'allowed_domains = ['sanet.st']start_urls = ['https://sanet.st/full/']定义解析(自我,响应):屈服 {# 做一点事.'result': response.xpath('//h3[@class="posts-results"]/text()').extract_first()}# next_page =/page-{}/其中 {} 页数.next_page = response.xpath('//a[@data-tip="下一页"]/@href').extract_first()# next_page = https://sanet.st/page-{}/其中 {} 页数.next_page = response.urljoin(next_page)# 如果 next_page 有值如果下一页:# 使用 url https://sanet.st/page-{}/调用解析,其中 {} 页数.产生scrapy.Request(url=next_page,回调=self.parse)

如果您使用-o sanet.json"键运行此代码,您将得到以下结果.

<块引用>

scrapy runpider sanet.py -o sanet.json

<预><代码>[{"result": "results 1 - 15 from 651"},{"result": "来自 651 的结果 16 - 30"},{"result": "results 31 - 45 from 651"},...等等....{"result": "results 631 - 645 from 651"},{"result": "results 646 - 651 from 651"}]

I have created a simple scrapy project, In which, I got the total page number from the initial site example.com/full. Now I need to scrape all the page starting from example.com/page-2 to 100(if total page count is 100). How can I do that?

Any advice would be helpful.

Code:

import scrapy


class AllSpider(scrapy.Spider):
    name = 'all'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com/full/']
    total_pages = 0

def parse(self, response):
    total_pages = response.xpath("//body/section/div/section/div/div/ul/li[6]/a/text()").extract_first()
    #urls = ('https://example.com/page-{}'.format(i) for i in range(1,total_pages))
    print(total_pages)

Update #1:

I tried using that urls = ('https://example.com/page-{}'.format(i) for i in range(1,total_pages)) but its not working, may be i'm doing something wrong.

Update #2: I have changed my code like this one

class AllSpider(scrapy.Spider):
name = 'all'
allowed_domains = ['sanet.st']
start_urls = ['https://sanet.st/full/']
total_pages = 0

def parse(self, response):
    total_pages = response.xpath("//body/section/div/section/div/div/ul/li[6]/a/text()").extract_first()
    for page in range(2, int(total_pages)):
        url = 'https://sanet.st/page-'+str(page)
        yield scrapy.Request(url)
        title =  response.xpath('//*[@class="list_item_title"]/h2/a/span/text()').extract()
        print(title)

But still the loop showing only the first page title repeatedly. I need to extract the title from different pages and print it in the prompt. How can i do that?

解决方案

You must search for the 'next_page' object and continue to loop while it is on the page.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request


class SanetSpider(scrapy.Spider):
    name = 'sanet'
    allowed_domains = ['sanet.st']
    start_urls = ['https://sanet.st/full/']

    def parse(self, response):
        yield {
            # Do something.
            'result': response.xpath('//h3[@class="posts-results"]/text()').extract_first()
        }

        # next_page = /page-{}/ where {} number of page.
        next_page = response.xpath('//a[@data-tip="Next page"]/@href').extract_first()

        # next_page = https://sanet.st/page-{}/ where {} number of page.
        next_page = response.urljoin(next_page)

        # If next_page have value
        if next_page:
            # Recall parse with url https://sanet.st/page-{}/ where {} number of page.
            yield scrapy.Request(url=next_page, callback=self.parse)

If you run this code with the "-o sanet.json" key you will get the following result.

scrapy runspider sanet.py -o sanet.json

[
{"result": "results 1 - 15 from 651"},
{"result": "results 16 - 30 from 651"},
{"result": "results 31 - 45 from 651"},
...
etc.
...
{"result": "results 631 - 645 from 651"},
{"result": "results 646 - 651 from 651"}
]

这篇关于具有多个页面的 Scrapy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆