rvest::html_text 和 RSelenium::getPageSource 有什么区别? [英] What is the difference between rvest::html_text and RSelenium::getPageSource?

查看:60
本文介绍了rvest::html_text 和 RSelenium::getPageSource 有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓取许多网页,在那里我注意到 rvest(read_html,然后是 html_text)提供的结果与 RSelenium (getPageSource()) 提供的结果不同.

I'm scraping a number of webpages, where I noticed the different results that rvest (read_html, then html_text) provides, and the one that RSelenium (getPageSource()) provides.

更具体地说,当涉及下拉菜单时,使用 html_text 只会为您提供选项的名称,而使用 RSelenium 时,您可以获取选择后将被定向到的页面的 url.

More specifically, when dropdown menus are involved, using html_text only gives you the names of the choices, while using RSelenium you can get the url of the page that you will be directed to once you choose one.

我的问题是:(1)为什么会有差异,差异的本质是什么?和 (2) 有没有办法获得与 RSelenium 相同的源文本提取,但使用更快的方法,例如 rvest 包?

My question here would be : (1) why the difference, and what exactly is the nature of the difference? and (2) is there a way to get the same source text extraction as RSelenium one, but using a faster way such as rvest package?

根据 rvest 与用于文本提取的 RSelenium 结果,它们的 getSource 函数确实提供了与 RSelenium 相同的结果.然而,虽然这比 RSelenium 快,但它仍然比 rvest 慢得多.

I have tried using webdriver, a PhantomJS implementation, per suggestion from rvest vs RSelenium results for text extracting , and their getSource function does provide the same results as RSelenium. However, while this is faster than RSelenium, it is still much slower than rvest.

library(rvest)
library(RSelenium)
library(webdriver)
library(tictoc)
library(robotstxt)

test_url <- "https://www.bea.gov"
robotstxt::paths_allowed(test_url)

# rvest
tictoc::tic()
resultA <- html_text(read_html(test_url))
tictoc::toc()

# RSelenium
tictoc::tic()
remDr <- remoteDriver(port = 4445L, browserName = "firefox")
remDr$open()

remDr$navigate(test_url)
resultB <- remDr$getPageSource(test_url)
tictoc::toc()

# webdriver
tictoc::tic()
pjs <- run_phantomjs()
ses <- Session$new(port = pjs$port)

ses$go(test_url)
resultC <- ses$getSource()
tictoc::toc()

可以看到 resultA 与 resultB 和 resultC 不同.更具体地说,我的重点是从工具"这个词开始,这是用于选择有关本网站提供的工具"的不同选项卡的下拉菜单的部分.

You can see that resultA is different from resultB and resultC. More specifically, my focus would be something from the word "Tools" onwards, which is the part where the dropdown menu for choosing different tabs regarding "Tools" that this website provides.

只显示一小块,在 rvest 中选择BEARFACTS"是:

Showing just a small chunk, choosing "BEARFACTS" in rvest is:

BEARFACTS\n                                    \n                                                \n                                    

而在 RSelenium 中,它类似于以下内容:

while in RSelenium it is something like the following :

<li class=\"expanded dropdown\">\n                    <a href=\"https://apps.bea.gov/regional/bearfacts/\">BEARFACTS</a>\n  

推荐答案

RSeleniumrvest 的区别在于:

  • RSelenium 运行一个真正的网络浏览器,因此它将加载网页中包含的任何 javascript(javascript 通常用于在初始 html 之后加载额外的 html 元素或数据已加载).
  • rvest 不运行 javascript,因此可以更快地检索页面 html,但在初始页面加载后会错过任何使用 javascript 加载的元素.
  • RSelenium runs a real web browser, so it will load any javascript contained in the webpage (javascript is often used to load additional html elements or data after the initial html has loaded).
  • rvest does not run javascript, and therefore retrieves the page html faster, but will miss any elements loaded with javascript after the initial page load.

一些有用的提示:

  • 抓取不加载 javascript 的页面时,请使用 rvest.
  • 当您必须使用 RSelenium 时,请尝试使用无头选项来提高速度(它会像平常一样在浏览器中加载页面,但不会显示任何图形元素,因此速度会更快).
eCaps <- list(chromeOptions = list(
  args = c('--headless', '--disable-gpu', '--window-size=1280,800')
))

rD <- rsDriver(browser=c("chrome"), verbose = TRUE, chromever="78.0.3904.105", port=4447L, extraCapabilities = eCaps) 

这篇关于rvest::html_text 和 RSelenium::getPageSource 有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆