皮屑和硒似乎互相干扰 [英] scrapy and selenium seem to intervene each other

查看：88 发布时间：2020/7/28 3:09:05 python selenium selenium-webdriver web-scraping scrapy

本文介绍了皮屑和硒似乎互相干扰的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在网页抓取或使用scrapy和硒方面没有太多经验.如果我的代码中有太多不良做法，请首先道歉.

Hi I don't have much experience in web scraping or using scrapy and selenium. Apologize first if there are too many bad practices in my code.

我的代码的简要背景:我试图使用scrapy从多个网站上获取产品信息，并且我还使用了硒，因为我需要单击网页上的查看更多"按钮和不，谢谢"按钮.由于网站上有不同类别的href，因此我还需要请求这些子链接"，以确保我不会错过根页面上未显示的任何项目.

Brief background for my code: I tried to scrape information of products from multiple websites using scrapy, and I also use selenium because I need to click the "view more" button and "No thanks" button on the web page. Since there are href for different categories on the website, I also need to request those "sublinks" to make sure I don't miss any items not shown on the root page.

问题是，我注意到在此for循环for l in product_links:中，scrapy和硒的作用似乎很奇怪.例如，我希望response.url == self.driver.current_url始终为true.但是，它们在此for循环的中间变得不同.此外，self.driver似乎捕获了products = self.driver.find_elements_by_xpath('//div[@data-url]')当前URL中不存在的一些元素，然后无法在sub = self.driver.find_elements_by_xpath('//div[(@class="shelf-container") and (.//div/@data-url="' + l + '")]//h2')

The problem is, I notice in this for loop for l in product_links:, scrapy and selenium seems to act strangely. For example, I would expect response.url == self.driver.current_url would always be true. However, they become different in the middle of this for loop. Furthermore, self.driver seem to capture some elements not existing in the current url in products = self.driver.find_elements_by_xpath('//div[@data-url]') and then fail to retrieve them again in sub = self.driver.find_elements_by_xpath('//div[(@class="shelf-container") and (.//div/@data-url="' + l + '")]//h2')

非常感谢.我真的很困惑

Many thanks. I'm really confused.

from webScrape.items import ProductItem
from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
    name = 'name'
    domain = 'https://uk.burberry.com'

    def __init__(self):
        super().__init__()
        self.driver = webdriver.Chrome('path to driver')
        self.start_urls = [self.domain + '/' + k for k in ('womens-clothing', 'womens-bags', 'womens-scarves',
                                        'womens-accessories', 'womens-shoes', 'make-up', 'womens-fragrances')]
        self.pool = set()

    def parse(self, response):
        sub_links = response.xpath('//h2[starts-with(@class, "shelf1-section-title")]/a/@href').extract()
        if len(sub_links) > 0:
            for l in sub_links:
                yield Request(self.domain + l, callback = self.parse)
        self.driver.get(response.url)
        email_reg = self.driver.find_element_by_xpath('//button[@class="dc-reset dc-actions-btn js-data-capture-newsletter-block-cancel"]')
        if email_reg.is_displayed():
            email_reg.click()
        # Make sure to click all the "load more" buttons
        load_more_buttons = self.driver.find_elements_by_xpath('//div[@class="load-assets-button js-load-assets-button ga-shelf-load-assets-button"]')
        for button in load_more_buttons:
            if button.is_displayed():
                button.click()
        products = self.driver.find_elements_by_xpath('//div[@data-url]')
        product_links = [item.get_attribute('data-url') for item in products if item.get_attribute('data-url').split('-')[-1][1:] not in self.pool]
        for l in product_links:
            sub = self.driver.find_elements_by_xpath('//div[(@class="shelf-container") and (.//div/@data-url="' + l + '")]//h2')
            if len(sub) > 0:
                sub_category = ', '.join(set([s.get_attribute('data-ga-shelf-title') for s in sub]))
            else:
                sub_category = ''
            yield Request(self.domain + l, callback = self.parse_product, meta = {'sub_category': sub_category})

    def parse_product(self, response):
        item = ProductItem()
        item['id'] = response.url.split('-')[-1][1:]
        item['sub_category'] = response.meta['sub_category']
        item['name'] = response.xpath('//h1[@class="product-title transaction-title ta-transaction-title"]/text()').extract()[0].strip()
        self.pool.add(item['id'])
        yield item
        others = response.xpath('//input[@data-url]/@data-url').extract()
        for l in others:
            if l.split('-')[-1][1:] not in self.pool:
                yield Request(self.domain + l, callback = self.parse_product, meta = response.meta)

皮屑和硒似乎互相干扰 [英] scrapy and selenium seem to intervene each other

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

皮屑和硒似乎互相干扰 [英] scrapy and selenium seem to intervene each other

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭