Selenium WebDriver快速分析大量链接 [英] Selenium WebDriver analyze large collection of links quickly

查看:117
本文介绍了Selenium WebDriver快速分析大量链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含大量链接(大约300个)的网页,我想收集有关这些链接的信息.

这是我的代码:

beginning_time = Time.now
#This gets a collection of links from the webpage
tmp = driver.find_elements(:xpath,"//a[string()]")
end_time = Time.now
puts "Execute links:#{(end_time - beginning_time)*1000} milliseconds for #{tmp.length} links"


before_loop = Time.now
#Here I iterate through the links
tmp.each do |link|
    #I am not interested in the links I can't see
    if(link.location.x < windowX and link.location.y < windowY)
        #I then insert the links into a NoSQL database, 
        #but for all purposes you could imagine this as just saving the data in a hash table.
        $elements.insert({
            "text" => link.text,
            "href" => link.attribute("href"),
            "type" => "text",
            "x" => link.location.x,
            "y" => link.location.y,
            "url" => url,
            "accessTime" => accessTime,
            "browserId" => browserId
        })
    end
end
after_loop = Time.now
puts "The loop took #{(after_loop - before_loop)*1000} milliseconds"

当前需要20毫秒来获取链接集合,大约需要4000毫秒(或4秒)来获取链接信息.当我从NoSQL插入中分离访问器时,我发现NoSQL插入仅花费20毫秒,并且大部分时间都花在访问器上(访问器从NoSQL插入中分离出来后变得很慢,出于我不了解的原因) ),这使我得出结论,访问器必须正在执行JavaScript.

我的问题是:我如何更快地收集这些链接及其信息?

想到的第一个解决方案是尝试并行运行两个驱动程序,但是WebDrivers不是线程安全的,这意味着我将不得不创建WebDriver的新实例并导航到页面.这就提出了一个问题,即如何下载页面的源代码,以便可以将其加载到另一个驱动程序中,而这在Selenium中无法完成,因此必须使用桌面自动化工具在Chrome本身上执行,这会增加大量的开销./p>

我听说的另一种选择是停止使用ChromeDriver,而仅使用PhantomJS,但是我需要在可视浏览器中显示页面.

还有其他我没有考虑过的选择吗?

解决方案

您似乎纯粹是在使用Webdriver来执行Javascript,而不是访问对象.

如果您使用javascript掉线,可以尝试以下几种方法(对不起Java,但您知道了);

 //We have restricted via xpath so will get less links back AND will not haveto check the text within loop
        List<WebElement> linksWithText = driver.findElements(By.xpath("//a[text() and not(text()='')]"));

        for (WebElement link : linksWithText) {

            //Store the location details rather than re-get each time
            Point location = link.getLocation();
            Integer x = location.getX();
            Integer y = location.getY();

            if (x < windowX && y < windowY) {
                ///Insert all info using webdriver commands;
            }
        }

我通常使用远程网格,因此性能是测试中的关键问题,因此为什么我总是尝试通过CSS选择器或XPath进行限制,而不是获取所有内容并循环

I have a web page with an extremely large amount of links (around 300) and I would like to collect information on these links.

Here is my code:

beginning_time = Time.now
#This gets a collection of links from the webpage
tmp = driver.find_elements(:xpath,"//a[string()]")
end_time = Time.now
puts "Execute links:#{(end_time - beginning_time)*1000} milliseconds for #{tmp.length} links"


before_loop = Time.now
#Here I iterate through the links
tmp.each do |link|
    #I am not interested in the links I can't see
    if(link.location.x < windowX and link.location.y < windowY)
        #I then insert the links into a NoSQL database, 
        #but for all purposes you could imagine this as just saving the data in a hash table.
        $elements.insert({
            "text" => link.text,
            "href" => link.attribute("href"),
            "type" => "text",
            "x" => link.location.x,
            "y" => link.location.y,
            "url" => url,
            "accessTime" => accessTime,
            "browserId" => browserId
        })
    end
end
after_loop = Time.now
puts "The loop took #{(after_loop - before_loop)*1000} milliseconds"

It currently take 20ms to get the link collection and around 4000ms (or 4 seconds) to retrieve the information for the links. When I separate the accessors from the NoSQL insert, I find that the NoSQL insert only takes 20ms and that the majority of time is spent with the accessors (who became much slower after being separated from the NoSQL insert, for reasons I don't understand), which makes me conclude that the accessors must be executing JavaScript.

My question is: How do I collect these links and their information more quickly?

The first solution that came to mind was to try running two drivers in parallel, but WebDrivers are not thread-safe, meaning that I would have to create a new instance of the WebDriver and navigate to the page. This raises the question, how to download the source of the page so that it can be loaded into another driver, which cannot be done in Selenium, thus must be performed on Chrome itself with desktop automation tools, adding a considerable amount of overhead.

Another alternative I heard of was to stop use ChromeDriver and to just use PhantomJS, but I need to display the page in visual browser.

Is there any other alternative that I haven't considered yet?

解决方案

You seem to be using Webdriver purely to execute Javascript rather than access the objects.

A couple of ideas to try IF you drop using javascript (Excuse the java but you get the idea);

 //We have restricted via xpath so will get less links back AND will not haveto check the text within loop
        List<WebElement> linksWithText = driver.findElements(By.xpath("//a[text() and not(text()='')]"));

        for (WebElement link : linksWithText) {

            //Store the location details rather than re-get each time
            Point location = link.getLocation();
            Integer x = location.getX();
            Integer y = location.getY();

            if (x < windowX && y < windowY) {
                ///Insert all info using webdriver commands;
            }
        }

I normally use remote grids so performace is a key concern in my tests, hence why I always try to restrict by CSS Selectors or XPath rather than get everything and loop

这篇关于Selenium WebDriver快速分析大量链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆