如何使硒脚本更快? [英] How yo make a selenium Scripts faster?

查看:75
本文介绍了如何使硒脚本更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用python Selenium和Scrapy爬行网站.

I use python Selenium and Scrapy for crawling a website.

但是我的脚本太慢了,

Crawled 1 pages (at 1 pages/min)

我使用CSS SELECTOR而不是XPATH来优化时间. 我更改了中间件

i use CSS SELECTOR instead of XPATH for optimise the time. i change the midllewares

'tutorial.middlewares.MyCustomDownloaderMiddleware': 543,

硒太慢还是我应该在设置"中更改某些内容?

is Selenium is too slow or i should change something in Setting?

我的代码:

def start_requests(self):
    yield Request(self.start_urls, callback=self.parse)
def parse(self, response):
    display = Display(visible=0, size=(800, 600))
    display.start()
    driver = webdriver.Firefox()
    driver.get("http://www.example.com")
    inputElement = driver.find_element_by_name("OneLineCustomerAddress")
    inputElement.send_keys("75018")
    inputElement.submit()
    catNums = driver.find_elements_by_css_selector("html body div#page div#main.content div#sContener div#menuV div#mvNav nav div.mvNav.bcU div.mvNavLk form.jsExpSCCategories ul.mvSrcLk li")
    #INIT
    driver.find_element_by_css_selector(".mvSrcLk>li:nth-child(1)>label.mvNavSel.mvNavLvl1").click()
    for catNumber in xrange(1,len(catNums)+1):
        print "\n IN catnumber \n"
        driver.find_element_by_css_selector("ul#catMenu.mvSrcLk> li:nth-child(%s)> label.mvNavLvl1" % catNumber).click()
        time.sleep(5)
        self.parse_articles(driver)
        pages = driver.find_elements_by_xpath('//*[@class="pg"]/ul/li[last()]/a')

        if(pages):
            page = driver.find_element_by_xpath('//*[@class="pg"]/ul/li[last()]/a')

            checkText = (page.text).strip()
            if(len(checkText) > 0):
                pageNums = int(page.text)
                pageNums = pageNums  - 1
                for pageNumbers in range (pageNums):
                    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "waitingOverlay")))
                    driver.find_element_by_css_selector('.jsNxtPage.pgNext').click()
                    self.parse_articles(driver)
                    time.sleep(5)

def parse_articles(self,driver) :
    test = driver.find_elements_by_css_selector('html body div#page div#main.content div#sContener div#sContent div#lpContent.jsTab ul#lpBloc li div.prdtBloc p.prdtBDesc strong.prdtBCat')

def between(self, value, a, b):
    pos_a = value.find(a)
    if pos_a == -1: return ""
    pos_b = value.rfind(b)
    if pos_b == -1: return ""
    adjusted_pos_a = pos_a + len(a)
    if adjusted_pos_a >= pos_b: return ""
    return value[adjusted_pos_a:pos_b]

推荐答案

因此,您的代码在这里几乎没有缺陷.

So your code has few flaws here.

  1. 当scrapy Selectors更快,更高效时,您可以使用硒来解析页面内容.
  2. 您为每个响应启动一个Webdriver.

可以通过使用scrapy的Downloader middlewares雄辩地解决此问题! 您想要创建一个自定义的下载器中间件,该中间件将使用硒而不是草率的下载器来下载请求.

This can be resolved very eloquently by using scrapy's Downloader middlewares! You want to create a custom downloader middleware that would download requests using selenium rather than scrapy downloader.

例如,我使用这个:

# middlewares.py
class SeleniumDownloader(object):
    def create_driver(self):
        """only start the driver if middleware is ever called"""
        if not getattr(self, 'driver', None):
            self.driver = webdriver.Chrome()

    def process_request(self, request, spider):
        # this is called for every request, but we don't want to render
        # every request in selenium, so use meta key for those we do want.
        if not request.meta.get('selenium', False):
            return request
        self.create_driver()
        self.driver.get(request.url)
        return HtmlResponse(request.url, body=self.driver.page_source, encoding='utf-8')

激活您的中间件:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middleware.SeleniumDownloader': 13,
}

然后在您的Spider中,您可以通过添加meta参数来指定要通过selenium驱动程序下载的URL.

Then in your spider you can specify which urls to download via selenium driver by adding a meta argument.

# you can start with selenium
def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, meta={'selenium': True})

def parse(self, response):
    # this response is rendered by selenium!
    # also can use no selenium for another response if you wish
    url = response.xpath("//a/@href")
    yield scrapy.Request(url)

这种方法的优点是您的驱动程序仅启动一次,仅用于下载页面源代码,其余的则留给适当的异步抓取工具使用.
缺点是您无法单击周围的按钮等,因为您没有暴露于驱动程序中.大多数时候,您可以通过网络检查器对按钮的功能进行逆向工程,而您根本不需要对驱动程序本身进行任何单击.

The advantages of this approach is that you your driver is being started only once and used to download page source only, the rest is left to proper asynchronous scrapy tools.
The disadvantages is that you cannot click buttons around and such since you are not exposed to the driver. Most of the times you can reverse engineer what the buttons do via network inspector and you should never need to do any clicking with the driver itself.

这篇关于如何使硒脚本更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆