R - 使用 PhantomJS 等待页面在 RSelenium 中加载 [英] R - Waiting for page to load in RSelenium with PhantomJS

查看:28
本文介绍了R - 使用 PhantomJS 等待页面在 RSelenium 中加载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我组装了一个粗略的刮刀,可以从 Expedia 上刮下价格/航空公司:

I put together a crude scraper that scrapes prices/airlines from Expedia:

# Start the Server
rD <- rsDriver(browser = "phantomjs", verbose = FALSE)

# Assign the client
remDr <- rD$client

# Establish a wait for an element
remDr$setImplicitWaitTimeout(1000)

# Navigate to Expedia.com
appurl <- "https://www.expedia.com/Flights-Search?flight-type=on&starDate=04/30/2017&mode=search&trip=oneway&leg1=from:Denver,+Colorado,to:Oslo,+Norway,departure:04/30/2017TANYT&passengers=children:0,adults:1"
remDr$navigate(appURL)

# Give a crawl delay to see if it gives time to load web page
Sys.sleep(10)   # Been testing with 10

###ADD JAVASCRIPT INJECTION HERE###
remDr$executeScript(?)

# Extract Prices
webElem <- remDr$findElements(using = "css", "[class='dollars price-emphasis']")
prices <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(prices)

# Extract Airlines
webElem <- remDr$findElements(using = "css", "[data-test-id='airline-name']")
airlines <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(airlines)

# close client/server
remDr$close()
rD$server$stop()

如您所见,我内置了一个 ImplicitWaitTimeout 和一个 Sys.Sleep 调用,以便页面有时间在 phantomJS 中加载并且不会因请求而使网站过载.

As you can see, I built in an ImplicitWaitTimeout and a Sys.Sleep call so that the page has time to load in phantomJS and to not overload the website with requests.

一般来说,当循环一个日期范围时,抓取工具运行良好.但是,当连续循环 10 个以上的日期时,Selenium 有时会抛出 StaleElementReference 错误并停止执行.我知道这是因为页面尚未完全加载并且 class='dollars price-emphasis' 尚不存在.URL 结构很好.

Generally speaking, when looping over a date range, the scraper works well. However, when looping through 10+ dates consecutively, Selenium sometimes throws a StaleElementReference error and stops the execution. I know the reason for this is because the page has yet to load completely and the class='dollars price-emphasis' doesn't exist yet. The URL construction is fine.

每当页面一直成功加载时,抓取工具就会获得接近 60 个价格和航班.我之所以提到这一点,是因为有时脚本仅返回 15-20 个条目(在浏览器上正常检查此日期时,有 60 个).在这里,我得出结论,我只找到了 60 个元素中的 20 个,这意味着该页面仅部分加载.

Whenever the page successfully loads all the way, the scraper gets near 60 prices and flights. I'm mentioning this because there are times when the script returns only 15-20 entries (when checking this date normally on a browser, there are 60). Here, I conclude that I'm only finding 20 of 60 elements, meaning the page has only partially loaded.

我想通过注入 JavaScript 使这个脚本更加健壮,在寻找元素之前等待页面完全加载.我知道这样做的方法是 remDr$executeScript(),并且我找到了许多有用的 JS 代码片段来实现这一点,但是由于 JS 知识有限,我在将这些解决方案调整到在句法上使用我的脚本.

I want to make this script more robust by injecting JavaScript that waits for the page to fully load prior to looking for elements. I know the way to do this is remDr$executeScript(), and I have found many useful JS snippets for accomplishing this, but due to limited knowledge in JS, I'm having problems adapting these solutions to work syntactically with my script.

以下是在Selenium中等待页面加载提出的几种解决方案> &Selenium - 如何等待页面完全加载:

Here are several solutions that have been proposed from Wait for page load in Selenium & Selenium - How to wait until page is completely loaded:

基本代码:

remDr$executeScript(
WebDriverWait wait = new WebDriverWait(driver, 20);
By addItem = By.cssSelector("class=dollars price-emphasis");, args = list()
)

对基本脚本的补充:

1) 检查元素是否过时

1) Check for Staleness of an Element

# get the "Add Item" element
WebElement element = wait.until(ExpectedConditions.presenceOfElementLocated(addItem));
# wait the element "Add Item" to become stale
wait.until(ExpectedConditions.stalenessOf(element));

2) 等待元素的可见性

2) Wait for Visibility of element

wait.until(ExpectedConditions.visibilityOfElementLocated(addItem));

我曾尝试使用remDr$executeScript("return document.readyState").equals("complete") 作为继续抓取之前的检查,但页面始终显示为完整,即使它不是.

I have tried to use remDr$executeScript("return document.readyState").equals("complete") as a check before proceeding with the scrape, but the page always shows as complete, even if it's not.

有人对我如何调整这些解决方案之一以使用我的 R 脚本有任何建议吗?关于如何完全等待页面加载近 60 个找到的元素的任何想法?我仍然倾向于,所以任何帮助将不胜感激.

Does anyone have any suggestions about how I could adapt one of these solutions to work with my R script? Any ideas on how I could wait entirely for the page to load with nearly 60 found elements? I'm still leaning, so any help would be greatly appreciated.

推荐答案

使用 while/tryCatch 的解决方案:

Solution using while/tryCatch:

remDr$navigate("<webpage url>")
webElem <-NULL
while(is.null(webElem)){
  webElem <- tryCatch({remDr$findElement(using = 'name', value = "<value>")},
  error = function(e){NULL})
 #loop until element with name <value> is found in <webpage url>
}

这篇关于R - 使用 PhantomJS 等待页面在 RSelenium 中加载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆