如何使用 Scrapy 浏览基于 js/ajax(href=“#") 的分页? [英] How to navigate through js/ajax(href="#") based pagination with Scrapy?

查看:19
本文介绍了如何使用 Scrapy 浏览基于 js/ajax(href=“#") 的分页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想遍历所有类别网址并抓取每个页面的内容.虽然 urls = [response.xpath('//ul[@class=flexboxesmain Categorieslist"]/li/a/@href').extract()[0]] 在这段代码中我试图只获取第一个类别的网址,但我的目标是获取所有网址和每个网址中的内容.

I want to iterate through all the category urls and scrape the content from each page. Although urls = [response.xpath('//ul[@class="flexboxesmain categorieslist"]/li/a/@href').extract()[0]] in this code I have tried to fetch only the first category url but my goal is to fetch all urls and the content inside each urls.

我正在使用scrapy_selenium 库.Selenium 页面源没有传递给scrape_it"函数.请检查我的代码,如果有什么问题,请告诉我.我是scrapy框架的新手.

I'm using scrapy_selenium library. Selenium page source is not passing to the 'scrape_it' function. Please review my code and let me know if there's anything wrong in it. I'm new to scrapy framework.

下面是我的蜘蛛代码-

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
from ..items import CouponcollectItem

class Couponsite6SpiderSpider(scrapy.Spider):
    name = 'couponSite6_spider'
    allowed_domains = ['www.couponcodesme.com']
    start_urls = ['https://www.couponcodesme.com/ae/categories']
    
    def parse(self, response):   
        urls = [response.xpath('//ul[@class="flexboxesmain categorieslist"]/li/a/@href').extract()[0]]
        for url in urls:
            yield SeleniumRequest(
                url=response.urljoin(url),
                wait_time=3,
                callback=self.parse_urls
            ) 

    def parse_urls(self, response):
        driver = response.meta['driver']
        while True:
            next_page = driver.find_element_by_xpath('//a[@class="category_pagination_btn next_btn bottom_page_btn"]')
            try:
                html = driver.page_source
                response_obj = Selector(text=html)
                self.scrape_it(response_obj)
                next_page.click()
            except:
                break
        driver.close()

    def scrape_it(self, response):
        items = CouponcollectItem()
        print('Hi there')
        items['store_img_src'] = response.css('#temp1 > div > div.voucher_col_left.flexbox.spaceBetween > div.vouchercont.offerImg.flexbox.column1 > div.column.column1 > div > div > a > img::attr(src)').extract()
        yield items  

我在 settings.py 文件中添加了以下代码 -

I have added the following code inside settings.py file -

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

#SELENIUM
from shutil import which

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless']  # '--headless' if using chrome instead of firefox

我附上了 terminal_output 屏幕截图.感谢您的时间!请帮我解决这个问题.

I'm attaching a terminal_output screenshot. Thank you for your time! Please help me solve this.

推荐答案

问题是不能在异步运行的线程之间共享驱动程序,也不能并行运行多个线程.您可以取出收益,一次一个:

The problem is you can't share the driver among asynchronously running threads, and you also can't run more than one in parallel. You can take the yield out and it will do them one at a time:

顶部:

from selenium import webdriver
import time

driver = webdriver.Chrome()

然后在你的课堂上:

def parse(self, response):
  urls = response.xpath('//ul[@class="flexboxesmain categorieslist"]/li/a/@href').extract()
  for url in urls:
    self.do_category(url)

def do_page(self):
  time.sleep(1)
  html = driver.page_source
  response_obj = Selector(text=html)
  self.scrape_it(response_obj)

def do_category(self, url):
  driver.get(url)
  self.do_page()
  next_links = driver.find_elements_by_css_selector('a.next_btn')
  while len(next_links) > 0:
    next_links[0].click()
    self.do_page()
    next_links = driver.find_elements_by_css_selector('a.next_btn')

如果这对你来说太慢了,我建议你改用 Puppeteer.

And if that's too slow for you I recommend switching to Puppeteer.

这篇关于如何使用 Scrapy 浏览基于 js/ajax(href=“#") 的分页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆