获取网址时,HtmlUnitDriver会导致问题 [英] HtmlUnitDriver causes problems while getting an url

查看:157
本文介绍了获取网址时,HtmlUnitDriver会导致问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Selenium库在Java中开发了一个页面爬虫。抓取工具通过一个通过Javascript 3应用程序启动的网站,这些应用程序在弹出窗口中显示为HTML。



抓取工具在启动其中两个应用程序时没有问题,但是在第三个爬虫永远冻结。



我使用的代码类似于

  public void applicationSelect(){
...
//通过解析标记href来获取url
...

this.driver = new HtmlUnitDriver(BrowserVersion.INTERNET_EXPLORER_8);
this.driver.seJavascriptEnabled(true);
this.driver.get(url); //代码在第3个应用程序的这一点之后不执行
...
}

我也尝试通过以下代码点击网页元素

  public void applicationSelect(){
...
WebElement element = this.driver.findElementByLinkText(linkText);
element.click(); //代码在第3个应用程序的这一点之后不执行
...
}

点击它会产生完全相同的结果。对于上面的代码,我已经确定我得到了正确的元素。



任何人都可以告诉我我可能遇到的问题是什么?



在申请方面,我无法透露有关html代码的任何信息。我知道这会让解决问题变得更加困难,而且我提前道歉。



===更新2013-04-10 ===



所以,我将这些资源添加到了我的抓取工具中,看到了这个.driver.get(url)中的哪些内容被卡住了。



基本上,驱动程序在无限刷新循环中丢失。在由HtmlUnitDriver实例化的WebClient对象中,加载了一个HtmlPage,它似乎不断刷新。



这是来自WaitingRefreshHandler的代码,它包含在com.gargoylesoftware.htmlunit中:

  public void handleRefresh(final Page page,final URL url,final int requestedWait)抛出IOException {
int seconds = requestedWait;
if(seconds> maxwait_&& maxwait_> 0){
seconds = maxwait_;
}
尝试{
Thread.sleep(秒* 1000);
}
catch(最终InterruptedException e){
/ *当从setTimeout或setInterval启动
*的导航中发生刷新时,可能会发生这种情况。导航将导致所有线程获得
*中断,包括此情况下的当前线程。
*应该是安全的,忽略它,因为这是现在正在进行导航的线程。最终我们应该
* refactor强制所有导航发生在主线程上。
* /
if(LOG.isDebugEnabled()){
LOG.debug(等待线程被中断。忽略中断以继续导航。);
}
}
final WebWindow window = page.getEnclosingWindow();
if(window == null){
return;
}
final WebClient client = window.getWebClient();
client.getPage(window,new WebRequest(url));
}

指令client.getPage(window,new WebRequest(url))再次调用WebClient来重新加载页面,只再次调用这个相同的刷新方法。这似乎继续下去,不会因为Thread.sleep(秒* 1000)而迅速填满内存,这会迫使3m等待再次尝试。



<有没有人对如何解决这个问题有任何建议?我有一个建议是创建2个新的HtmlUnitDriver和WebClient类,它们扩展了原始的类。然后覆盖相关方法以避免此问题。



再次感谢。

解决方案

我通过创建一个什么都不做的RefreshHandler类来解决我的永恒刷新问题:

  public class RefreshHandler实现了com.gargoylesoftware .htmlunit.RefreshHandler {
public RefreshHandler(){}
public void handleRefresh(final Page page,final URL url,final int secods){}
}

此外,我扩展了HtmlUnitDriver类,并通过重写方法modifyWebClient,设置了新的RefreshHandler:

 公共类HtmlUnitDriverExt扩展HtmlUnitDriver {
public HtmlUnitDriverExt(BrowserVersion version){
super(version);
}
@Override
protected WebClient modifyWebClient(WebClient客户端){
client.setRefreshHandler(new RefreshHandler());
返回客户;
}
}

方法modifyWebClient是在HtmlUnitDriver中创建的无操作方法正是出于这个目的。



干杯。


I have a page crawler developed in Java using Selenium libraries. The crawler goes through a website that launches through Javascript 3 applications which are displayed as HTML in popup windows.

The crawler has no issues when launching 2 of the applications, but on the 3rd one the crawler freezes forever.

The code I'm using is similar to

public void applicationSelect() {
  ...
  //obtain url by parsing tag href attributed
  ...

  this.driver = new HtmlUnitDriver(BrowserVersion.INTERNET_EXPLORER_8);
  this.driver.seJavascriptEnabled(true);
  this.driver.get(url); //the code does not execute after this point for the 3rd app
  ...
}

I have also tried clicking on the web element through the following code

public void applicationSelect() {
  ...
  WebElement element = this.driver.findElementByLinkText("linkText");
  element.click(); //the code does not execute after this point for the 3rd app
  ...
}

Clicking on it produces exactly the same result. For the above code, I've made sure I am getting the right element.

Can anyone tell me what could be the problem I'm having?

On the application side, I cannot disclose any information about the html code. I know this makes things harder for trying to solve the problem and for that I apologize in advance.

=== Update 2013-04-10 ===

So, I added the sources to my crawlers and saw where in this.driver.get(url) it was getting stuck on.

Basically, the driver gets lost in an infinite refresh loop. Within a WebClient object instantiated by HtmlUnitDriver, an HtmlPage is loaded which continually refreshes seemingly without end.

Here is the code from WaitingRefreshHandler, which is contained in com.gargoylesoftware.htmlunit:

public void handleRefresh(final Page page, final URL url, final int requestedWait) throws IOException {
  int seconds = requestedWait;
  if (seconds > maxwait_ && maxwait_ > 0) {
    seconds = maxwait_;
  }
  try {
    Thread.sleep(seconds * 1000);
  }
  catch (final InterruptedException e) {
    /* This can happen when the refresh is happening from a navigation that started
     * from a setTimeout or setInterval. The navigation will cause all threads to get
     * interrupted, including the current thread in this case. It should be safe to
     * ignore it since this is the thread now doing the navigation. Eventually we should
     * refactor to force all navigation to happen back on the main thread.
     */
    if (LOG.isDebugEnabled()) {
      LOG.debug("Waiting thread was interrupted. Ignoring interruption to continue navigation.");
    }
  }
  final WebWindow window = page.getEnclosingWindow();
  if (window == null) {
    return;
  }
  final WebClient client = window.getWebClient();
  client.getPage(window, new WebRequest(url));
}

The instruction "client.getPage(window, new WebRequest(url))" calls WebClient once again to reload the page, only to once more call this very same refresh method. This seems to go on indefinetly, not filling up the memory quickly only because of the "Thread.sleep(seconds * 1000)", which forces a 3m wait before trying again.

Does anyone have any suggestion on how I can work around this issue? I got a suggestion to create 2 new HtmlUnitDriver and WebClient classes which extend the original ones. Then override the relevant methods in order to avoid this problem.

Thanks again.

解决方案

I solved my eternal refresh problem by creating a do nothing RefreshHandler class:

public class RefreshHandler implements com.gargoylesoftware.htmlunit.RefreshHandler {   
  public RefreshHandler() { }
  public void handleRefresh(final Page page, final URL url, final int secods) { }
}

In addition, I extended the HtmlUnitDriver class and by overriding the method modifyWebClient, I set the new RefreshHandler:

public class HtmlUnitDriverExt extends HtmlUnitDriver { 
  public HtmlUnitDriverExt(BrowserVersion version) {
    super(version);
  }
  @Override
  protected WebClient modifyWebClient(WebClient client) {
    client.setRefreshHandler(new RefreshHandler());
    return client;
  }
}

The method modifyWebClient is a do nothing method created in HtmlUnitDriver exactly for this purpose.

Cheers.

这篇关于获取网址时,HtmlUnitDriver会导致问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆