如何让 Selenium 脚本运行得更快? [英] How to make Selenium scripts work faster?

查看:97
本文介绍了如何让 Selenium 脚本运行得更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Python Selenium 和 Scrapy 来抓取网站.
但是我的脚本太慢了,

I use Python Selenium and Scrapy for crawling a website.
But my script is so slow,

Crawled 1 pages (at 1 pages/min)

我使用 CSS SELECTOR 而不是 XPATH 来优化时间.
我改变了中间件

I use CSS SELECTOR instead of XPATH for optimise the time.
I change the middlewares

'tutorial.middlewares.MyCustomDownloaderMiddleware': 543,

Selenium 是不是太慢了,还是我应该在设置中更改一些内容?

is Selenium is too slow or I should change something in Setting?

我的代码:

def start_requests(self):
    yield Request(self.start_urls, callback=self.parse)
def parse(self, response):
    display = Display(visible=0, size=(800, 600))
    display.start()
    driver = webdriver.Firefox()
    driver.get("http://www.example.com")
    inputElement = driver.find_element_by_name("OneLineCustomerAddress")
    inputElement.send_keys("75018")
    inputElement.submit()
    catNums = driver.find_elements_by_css_selector("html body div#page div#main.content div#sContener div#menuV div#mvNav nav div.mvNav.bcU div.mvNavLk form.jsExpSCCategories ul.mvSrcLk li")
    #INIT
    driver.find_element_by_css_selector(".mvSrcLk>li:nth-child(1)>label.mvNavSel.mvNavLvl1").click()
    for catNumber in xrange(1,len(catNums)+1):
        print "\n IN catnumber \n"
        driver.find_element_by_css_selector("ul#catMenu.mvSrcLk> li:nth-child(%s)> label.mvNavLvl1" % catNumber).click()
        time.sleep(5)
        self.parse_articles(driver)
        pages = driver.find_elements_by_xpath('//*[@class="pg"]/ul/li[last()]/a')

        if(pages):
            page = driver.find_element_by_xpath('//*[@class="pg"]/ul/li[last()]/a')

            checkText = (page.text).strip()
            if(len(checkText) > 0):
                pageNums = int(page.text)
                pageNums = pageNums  - 1
                for pageNumbers in range (pageNums):
                    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "waitingOverlay")))
                    driver.find_element_by_css_selector('.jsNxtPage.pgNext').click()
                    self.parse_articles(driver)
                    time.sleep(5)

def parse_articles(self,driver) :
    test = driver.find_elements_by_css_selector('html body div#page div#main.content div#sContener div#sContent div#lpContent.jsTab ul#lpBloc li div.prdtBloc p.prdtBDesc strong.prdtBCat')

def between(self, value, a, b):
    pos_a = value.find(a)
    if pos_a == -1: return ""
    pos_b = value.rfind(b)
    if pos_b == -1: return ""
    adjusted_pos_a = pos_a + len(a)
    if adjusted_pos_a >= pos_b: return ""
    return value[adjusted_pos_a:pos_b]

推荐答案

所以你的代码在这里几乎没有缺陷.

So your code has few flaws here.

  1. 当scrapy Selector 更快、更高效时,您可以使用selenium 来解析页面内容.
  2. 您为每个响应启动一个网络驱动程序.

使用scrapy的下载器中间件可以很好地解决这个问题!您想创建一个自定义下载器中间件,该中间件将使用 selenium 而不是 scrapy 下载器下载请求.

This can be resolved very eloquently by using scrapy's Downloader middlewares! You want to create a custom downloader middleware that would download requests using selenium rather than scrapy downloader.

例如我使用这个:

# middlewares.py
class SeleniumDownloader(object):
    def create_driver(self):
        """only start the driver if middleware is ever called"""
        if not getattr(self, 'driver', None):
            self.driver = webdriver.Chrome()

    def process_request(self, request, spider):
        # this is called for every request, but we don't want to render
        # every request in selenium, so use meta key for those we do want.
        if not request.meta.get('selenium', False):
            return request
        self.create_driver()
        self.driver.get(request.url)
        return HtmlResponse(request.url, body=self.driver.page_source, encoding='utf-8')

激活您的中间件:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middleware.SeleniumDownloader': 13,
}

然后在您的蜘蛛中,您可以通过添加元参数来指定要通过 selenium 驱动程序下载的网址.

Then in your spider you can specify which urls to download via selenium driver by adding a meta argument.

# you can start with selenium
def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, meta={'selenium': True})

def parse(self, response):
    # this response is rendered by selenium!
    # also can use no selenium for another response if you wish
    url = response.xpath("//a/@href")
    yield scrapy.Request(url)

这种方法的优点是您的驱动程序只启动一次并且仅用于下载页面源代码,其余的留给适当的异步抓取工具.
缺点是你不能点击周围的按钮等,因为你没有接触到驱动程序.大多数情况下,您可以通过网络检查器对按钮的功能进行逆向工程,而且您永远不需要对驱动程序本身进行任何点击.

The advantages of this approach is that you your driver is being started only once and used to download page source only, the rest is left to proper asynchronous scrapy tools.
The disadvantages is that you cannot click buttons around and such since you are not exposed to the driver. Most of the times you can reverse engineer what the buttons do via network inspector and you should never need to do any clicking with the driver itself.

这篇关于如何让 Selenium 脚本运行得更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆