Scrapy + selenium 为每个 url 请求两次 [英] Scrapy + selenium requests twice for each url

查看:64
本文介绍了Scrapy + selenium 为每个 url 请求两次的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

import scrapy 
from selenium import webdriver

class ProductSpider(scrapy.Spider):
    name = "product_spider"
    allowed_domains = ['ebay.com']
    start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)

        while True:
            next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

            try:
                next.click()

                # get the data and write it to scrapy items
            except:
                break

        self.driver.close()

selenium with scrapy for dynamic page

这个解决方案效果很好,但它对同一个 url 请求两次,一个是由scrapy调度程序请求的,另一个是由硒网络驱动程序请求的.

This solution works good, but it requests twice for the same url one by scrapy scheduler and another by the selenium web driver.

与没有硒的scrapy请求相比,完成工作需要两倍的时间.如何避免这种情况?

It'll take twice the time to finish the job, compared to scrapy request without selenium. How to avoid this?

推荐答案

这里有一个技巧,可以用来解决这个问题.

Here is a trick that can be useful to solve this problem.

为 selenium 创建一个 Web 服务,在本地运行它

from flask import Flask, request, make_response
from flask_restful import Resource, Api
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

app = Flask(__name__)
api = Api(app)

class Selenium(Resource):
    _driver = None

    @staticmethod
    def getDriver():
        if not Selenium._driver:
            chrome_options = Options()
            chrome_options.add_argument("--headless")

            Selenium._driver = webdriver.Chrome(chrome_options=chrome_options)
        return Selenium._driver

    @property
    def driver(self):
        return Selenium.getDriver()

    def get(self):
        url = str(request.args['url'])

        self.driver.get(url)

        return make_response(self.driver.page_source)

api.add_resource(Selenium, '/')

if __name__ == '__main__':
    app.run(debug=True)

现在 http://127.0.0.1:5000/?url=https://stackoverflow.com/users/5939254/yash-pokar 将使用 selenium Chrome/Firefox 驱动程序返回编译后的网页.

now http://127.0.0.1:5000/?url=https://stackoverflow.com/users/5939254/yash-pokar will return you the compiled web page using selenium Chrome/Firefox driver.

现在我们的蜘蛛会是什么样子,

now how our spider will look like,

import scrapy
import urllib


class ProductSpider(scrapy.Spider):
    name = 'products'
    allowed_domains = ['ebay.com']
    urls = [
        'http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40',
    ]

    def start_requests(self):
        for url in self.urls:
            url = 'http://127.0.0.1:5000/?url={}'.format(urllib.quote(url))
            yield scrapy.Request(url)

    def parse(self, response):
        yield {
            'field': response.xpath('//td[@class="pagn-next"]/a'),
        }

这篇关于Scrapy + selenium 为每个 url 请求两次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆