皮屑和硒似乎互相干扰 [英] scrapy and selenium seem to intervene each other

查看:88
本文介绍了皮屑和硒似乎互相干扰的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在网页抓取或使用scrapy和硒方面没有太多经验.如果我的代码中有太多不良做法,请首先道歉.

Hi I don't have much experience in web scraping or using scrapy and selenium. Apologize first if there are too many bad practices in my code.

我的代码的简要背景:我试图使用scrapy从多个网站上获取产品信息,并且我还使用了硒,因为我需要单击网页上的查看更多"按钮和不,谢谢"按钮.由于网站上有不同类别的href,因此我还需要请求这些子链接",以确保我不会错过根页面上未显示的任何项目.

Brief background for my code: I tried to scrape information of products from multiple websites using scrapy, and I also use selenium because I need to click the "view more" button and "No thanks" button on the web page. Since there are href for different categories on the website, I also need to request those "sublinks" to make sure I don't miss any items not shown on the root page.

问题是,我注意到在此for循环for l in product_links:中,scrapy和硒的作用似乎很奇怪.例如,我希望response.url == self.driver.current_url始终为true.但是,它们在此for循环的中间变得不同.此外,self.driver似乎捕获了products = self.driver.find_elements_by_xpath('//div[@data-url]')当前URL中不存在的一些元素,然后无法在sub = self.driver.find_elements_by_xpath('//div[(@class="shelf-container") and (.//div/@data-url="' + l + '")]//h2')

The problem is, I notice in this for loop for l in product_links:, scrapy and selenium seems to act strangely. For example, I would expect response.url == self.driver.current_url would always be true. However, they become different in the middle of this for loop. Furthermore, self.driver seem to capture some elements not existing in the current url in products = self.driver.find_elements_by_xpath('//div[@data-url]') and then fail to retrieve them again in sub = self.driver.find_elements_by_xpath('//div[(@class="shelf-container") and (.//div/@data-url="' + l + '")]//h2')

非常感谢.我真的很困惑

Many thanks. I'm really confused.

from webScrape.items import ProductItem
from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
    name = 'name'
    domain = 'https://uk.burberry.com'

    def __init__(self):
        super().__init__()
        self.driver = webdriver.Chrome('path to driver')
        self.start_urls = [self.domain + '/' + k for k in ('womens-clothing', 'womens-bags', 'womens-scarves',
                                        'womens-accessories', 'womens-shoes', 'make-up', 'womens-fragrances')]
        self.pool = set()

    def parse(self, response):
        sub_links = response.xpath('//h2[starts-with(@class, "shelf1-section-title")]/a/@href').extract()
        if len(sub_links) > 0:
            for l in sub_links:
                yield Request(self.domain + l, callback = self.parse)
        self.driver.get(response.url)
        email_reg = self.driver.find_element_by_xpath('//button[@class="dc-reset dc-actions-btn js-data-capture-newsletter-block-cancel"]')
        if email_reg.is_displayed():
            email_reg.click()
        # Make sure to click all the "load more" buttons
        load_more_buttons = self.driver.find_elements_by_xpath('//div[@class="load-assets-button js-load-assets-button ga-shelf-load-assets-button"]')
        for button in load_more_buttons:
            if button.is_displayed():
                button.click()
        products = self.driver.find_elements_by_xpath('//div[@data-url]')
        product_links = [item.get_attribute('data-url') for item in products if item.get_attribute('data-url').split('-')[-1][1:] not in self.pool]
        for l in product_links:
            sub = self.driver.find_elements_by_xpath('//div[(@class="shelf-container") and (.//div/@data-url="' + l + '")]//h2')
            if len(sub) > 0:
                sub_category = ', '.join(set([s.get_attribute('data-ga-shelf-title') for s in sub]))
            else:
                sub_category = ''
            yield Request(self.domain + l, callback = self.parse_product, meta = {'sub_category': sub_category})

    def parse_product(self, response):
        item = ProductItem()
        item['id'] = response.url.split('-')[-1][1:]
        item['sub_category'] = response.meta['sub_category']
        item['name'] = response.xpath('//h1[@class="product-title transaction-title ta-transaction-title"]/text()').extract()[0].strip()
        self.pool.add(item['id'])
        yield item
        others = response.xpath('//input[@data-url]/@data-url').extract()
        for l in others:
            if l.split('-')[-1][1:] not in self.pool:
                yield Request(self.domain + l, callback = self.parse_product, meta = response.meta)

推荐答案

Scrapy是一个异步框架. parse*()方法中的代码并非总是线性运行.无论哪里有yield,该方法的执行可能会在代码的其他部分运行时在那里停止一段时间.

Scrapy is an asynchronous framework. The code in your parse*() methods does not always run linearly. Wherever there is a yield there, the execution of that method may stop there for some time while other parts of the code run.

因为循环中有一个yield,这说明了为什么您会遇到这种意外行为.在yield处,程序的其他一些代码将恢复执行,并可能将Selenium驱动程序切换到其他URL,并且当代码恢复循环时,Selenium驱动程序的URL已更改.

Because there is a yield in the loop, that explains why you are experiencing that unexpected behavior. At yield, some other code of your program resumes execution and may switch the Selenium driver to a different URL, and when the code resumes the loop the URL from the Selenium driver has changed.

说实话,据我所知,您实际上并不需要Scrapy中的Selenium.在Scrapy中,诸如Splash或Selenium之类的东西仅在非常特定的情况下使用,诸如避免bot检测之类的东西.

To be honest, you don’t really need Selenium in Scrapy for your use case, as far as I can see. In Scrapy, things like Splash or Selenium are only used on very specific scenarios, for things like avoiding bot detection.

通过使用Web浏览器(检查,网络)中的开发人员工具,然后在Scrapy中再现它们,通常是找出页面HTML的结构和请求中使用的参数的更好的方法.

It is usually a better approach to figure out the structure of the page HTML and the parameters used in requests by using the developer tools from your web browser (Inspect, Network) and then reproducing them in Scrapy.

这篇关于皮屑和硒似乎互相干扰的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆