单页应用程序中分页中的 Python Web Scraping [英] Python Web Scraping in Pagination in Single Page Application

查看:42
本文介绍了单页应用程序中分页中的 Python Web Scraping的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在研究如何在单页应用程序 (SPA) 中由 javascript 驱动的分页中使用 python 抓取 Web 内容.

I am currently researching on how to scrape web content using python in pagination driven by javascript in single page application (SPA).

例如,https://angular-8-pagination-example.stackblitz.io/

我在谷歌上搜索并发现使用 Scrapy 无法抓取 javascript/SPA 驱动的内容.它需要使用飞溅.我是 Scrapy 和 Splash 的新手.这是正确的吗?

I googled and found that using Scrapy is not possible to scrape javascript / SPA driven content. It needs to use Splash. I am new to both Scrapy and Splash. Is this correct?

另外,如何调用javascript分页方法?我检查了元素,它只是一个没有 href 和 javascript 事件的锚点.

Also, how do I call the javascript pagination method? I inspect the element, it's just an anchor without href and javascript event.

请指教.

谢谢,

哈杰

推荐答案

您需要使用 SpalshRequest 来呈现 JS.然后,您需要获取分页文本.通常,我使用带有适当正则表达式模式的 re.search 来提取相关数字.然后,您可以将它们分配给当前页面变量和总页面变量.

You need to use a SpalshRequest to render the JS. You then need to get the pagination text. Generally I use re.search with the appropriate regex pattern to extract the relevent numbers. You can then assign them to current page variable and total pages variables.

通常,网站会通过在 url 末尾增加 ?page=x 或 ?p=x 来移动到下一页.然后,您可以增加此值以抓取所有相关页面.

Typically a website will move to the next page by incrementing ?page=x or ?p=x at the end of the url. You can then increment this value to scrape all the relevant pages.

整体模式如下:

import scrapy
from scrapy_splash import SplashRequest
import re

from ..items import Item

proxy ='http//your.proxy.com:PORT'

current_page_xpath='//div[your x path selector]/text()'
last_page_xpath='//div[your other x path selector]/text()'

class spider(scrapy.Spider):

    name = 'my_spider'
    allowed_domains =['domain.com']

    start_urls =['https://www.domaintoscrape.com/page=1']
                 
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse, meta ={'proxy':proxy})

     def get_page_nbr(value):
  
      #you may need more complex regex to get page numbers.
      #most of the time they are in form "page X of Y"
      #google is your friend

      if re.search('\d+',value):
           value = re.search('\d+',value)
           value = value[0]
      else:
           value =None
      return  value

    def parse(self, response):
            #get last and current page from response:

            last_page = page_response.xpath(last_page_xpath).get()
            current_page = page_response.xpath(current_page_xpath).get()

            #do something with your response 
            # if current page is less than last page make another request by incrmenenting the page in the URL

            if current_page < last_page:
                ajax_url = response.url.replace(f'page={int(current_page)}',f'page={int(current_page)+1}')
                yield scrapy.Request(url=ajax_url, callback=self.parse, meta ={'proxy':proxy})

            #optional
            if current_page == last_page:
                print(f'processed {last_page} items for {response.url}')

最后,值得一看的是 Youtube,因为有许多关于 scrapy_splash 和分页的教程.

finally, its worth having a look on Youtube as there are a number of tutorials on scrapy_splash and pagination.

这篇关于单页应用程序中分页中的 Python Web Scraping的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆