可靠地检测页面加载或超时,Selenium 2 [英] Reliably detect page load or time out, Selenium 2

查看:164
本文介绍了可靠地检测页面加载或超时,Selenium 2的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Selenium 2(2.33版Python绑定,Firefox驱动程序)编写一个通用的网络抓取工具.应该使用一个任意 URL,加载该页面,并报告所有出站链接.由于URL是任意的,因此我无法对页面的内容做任何假设,因此通常的建议(等待特定的元素出现)是不适用的.

I am writing a generic web-scraper using Selenium 2 (version 2.33 Python bindings, Firefox driver). It is supposed to take an arbitrary URL, load the page, and report all of the outbound links. Because the URL is arbitrary, I cannot make any assumptions whatsoever about the contents of the page, so the usual advice (wait for a specific element to be present) is inapplicable.

我有应该轮询document.readyState直到达到完成"或30秒超时的代码,然后继续:

I have code which is supposed to poll document.readyState until it reaches "complete" or a 30s timeout has elapsed, and then proceed:

def readystate_complete(d):
    # AFAICT Selenium offers no better way to wait for the document to be loaded,
    # if one is in ignorance of its contents.
    return d.execute_script("return document.readyState") == "complete"

def load_page(driver, url):
    try:
        driver.get(url)
        WebDriverWait(driver, 30).until(readystate_complete)
    except WebDriverException:
        pass

    links = []
    try:
        for elt in driver.find_elements_by_xpath("//a[@href]"):
            try: links.append(elt.get_attribute("href"))
            except WebDriverException: pass
    except WebDriverException: pass
    return links

这种方法行得通,但是在五分之一的页面中,.until调用将永远挂起.发生这种情况时,通常浏览器实际上并没有完成页面的加载("throbber"仍在旋转),但可能要经过数十分钟,并且不会触发超时.但是有时页面似乎确实已完全加载,脚本仍然无法继续运行.

This sort-of works, but on about one page out of five, the .until call hangs forever. When this happens, usually the browser has not in fact finished loading the page (the "throbber" is still spinning) but tens of minutes can go by and the timeout does not trigger. But sometimes the page does appear to have loaded completely and the script still does not go on.

有什么作用?如何使超时可靠地工作?是否有更好的方法来请求等待页面加载(如果无法对内容进行任何假设)?

What gives? How do I make the timeout work reliably? Is there a better way to request a wait-for-page-to-load (if one cannot make any assumptions about the contents)?

注意:事实证明,对WebDriverException的强迫捕捉和忽略对于确保其从页面中提取尽可能多的链接是必需的,无论页面内的JavaScript是否正在使用DOM做有趣的事情(例如,用于在提取HREF属性的循环中获得过时的元素"错误.

Note: The obsessive catching-and-ignoring of WebDriverException has proven necessary to ensure that it extracts as many links from the page as possible, whether or not JavaScript inside the page is doing funny stuff with the DOM (e.g. I used to get "stale element" errors in the loop that extracts the HREF attributes).

注意::无论是在本网站还是在其他地方,此问题都有很多不同之处,但它们都存在细微但重要的区别,因此答案(如果有)对我,或者我尝试了这些建议,但它们不起作用. 准确地回答我所问的问题.

NOTE: There are a lot of variations on this question both on this site and elsewhere, but they've all either got a subtle but critical difference that makes the answers (if any) useless to me, or I've tried the suggestions and they don't work. Please answer exactly the question I have asked.

推荐答案

我遇到了类似的情况,因为我使用Selenium为相当知名的网站服务编写了屏幕截图系统,并且存在相同的困境:我一无所知正在加载的页面.

I have a similar situation as I wrote the screenshot system using Selenium for a fairly well-known website service and had the same predicament: I could not know anything about the page being loaded.

与Selenium的一些开发人员交谈后,答案是,各种WebDriver实现(例如Firefox Driver与IEDriver)对于何时将页面视为要加载或不加载以使WebDriver返回控制做出了不同的选择.

After speaking with some of the Selenium developers, the answer was that various WebDriver implementations (Firefox Driver versus IEDriver for example) make different choices about when a page is considered to be loaded or not for the WebDriver to return control.

如果您深入研究Selenium代码,则可以找到尝试并做出最佳选择的地方,但是由于有许多因素可能导致寻找状态失败,例如多个框架,而一个框架并没有这样做.如果不及时完成,有时驾驶员显然不会返回.

If you dig deep in Selenium code, you can find the spots that try and make the best choices, but since there are a number of things that can cause the state being looked for to fail, like multiple frames where one doesn't complete in a timely manner, there are cases where the driver obviously just does not return.

有人告诉我,这是一个开源项目",可能不会/无法针对每种可能的情况进行纠正,但是我可以进行修复并在适用的情况下提交补丁.

I was told, "it's an open-source project", and that it probably won't/can't be corrected for every possible scenario, but that I could make fixes and submit patches where applicable.

从长远来看,这对我来说有点麻烦,所以和您一样,我创建了自己的超时过程.自从我使用Java以来​​,我创建了一个新的线程,该线程在达到超时时会尝试做一些事情来使WebDriver返回,即使有时只是按某些键来使浏览器响应也可以.如果它没有返回,那么我将终止浏览器,然后重试.

In the long run, that was a bit much for me to take on, so similar to you, I created my own timeout process. Since I use Java, I created a new Thread that upon reaching the timeout, tries to do several things to get WebDriver to return, even at times just pressing certain Keys to get the browser to respond has worked. If it does not return, then I kill the browser and try again as well.

重新启动驱动程序已经为我们处理了大多数情况,好像浏览器的第二次加载使它处于更稳定的状态(请注意,我们是从VM启动的,并且浏览器一直希望检查更新并运行某些最近未启动的例程).

Starting the driver again has handled most cases for us, as if the second load of the browser allowed it to be in a more settled state (mind you we are launching from VMs and the browser constantly wants to check for updates and run certain routines when it hasn't been launched recently).

另一个问题是,我们首先启动了一个已知的url,并确认了有关浏览器的某些方面,并且实际上我们能够与它进行交互,然后再继续.通过这些步骤,故障率非常低,在所有浏览器/版本/操作系统(FF,IE,CHROME,Safari,Opera,iOS,Android等)上进行1000次测试后,故障率约为3%

Another piece of this is that we launch a known url first and confirm some aspects about the browser and that we are in fact able to interact with it before continuing. With these steps together the failure rate is pretty low, about 3% with 1000s of tests on all browsers/version/OSs (FF, IE, CHROME, Safari, Opera, iOS, Android, etc.)

最后但并非最不重要的,对于您来说,这听起来像您只需要捕获页面上的链接,而没有完全的浏览器自动化功能.我可能会采用其他方法,即cURL和linux工具.

Last but not least, for your case, it sounds like you only really need to capture the links on the page, not have full browser automation. There are other approaches I might take toward that, namesly cURL and linux tools.

这篇关于可靠地检测页面加载或超时,Selenium 2的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆