rvest 与 RSelenium 结果用于文本提取 [英] rvest vs RSelenium results for text extracting

查看:43
本文介绍了rvest 与 RSelenium 结果用于文本提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

到目前为止,我使用 RSelenium 来提取主页的文本,但我想切换到像 rvest 这样的快速解决方案.

So far i am using RSelenium to extract the text of a Homepage, but i would like to Switch to a fast solution like rvest.

library(rvest)
url = 'https://www.r-bloggers.com'
rvestResults <- read_html(url) %>%
  html_node('body') %>%
  html_text()

library(RSelenium)
remDr$navigate(url)
rSelResults <- remDr$findElement(
  using = "xpath",
  value = "//body"
)$getElementText()

比较下面的结果表明 rvest 包含一些 JavaScript 代码,而RSelenium 更干净".

Comparing the results below Shows that rvest includes some JavaScript Code, while the RSelenium is much "cleaner".

我知道 rvest 和 rselenium 之间的区别,rselenium 使用无头浏览器,而 rvest 只读取普通主页".

I am aware of the differences between rvest and rselenium, that rselenium uses a headless browser and rvest just reads the "plain Homepage".

我的问题是:有没有一种方法可以使用 rvest 或与 rvest 相同(或更快)的第三种方法获得下面的 Rselenium 输出?

Rvest 结果:

> substring(rvestResults, 1, 500)
[1] "\n\n\n\t\t    \t    \t\n        \n        R news and tutorials contributed by (750) R bloggers         \n    Home\nAbout\nRSS\nadd your blog!\nLearn R\nR jobs\nSubmit a new job (it’s free)\n\tBrowse latest jobs (also free)\n\nContact us\n\n\n\n\n\n\n\n    \n\t\tWelcome!
     \t\t\t\r\nfunction init() {\r\nvar vidDefer = document.getElementsByTagName('iframe');\r\nfor (var i=0; i<vidDefer.length; i++) {\r\nif(vidDefer[i].getAttribute('data-src')) 
     {\r\nvidDefer[i].setAttribute('src',vidDefer[i].getAttribute('data-src'));\r\n} } }\r\nwindow.onload = i"

RSelenium 结果:

RSelenium results:

> substring(rSelResults, 1, 500)
[1] "R news and tutorials contributed by (750) R bloggers\nHome\nAbout\nRSS\nadd your blog!\nLearn R\nR jobs\n�\n�\n�\nContact us\nWELCOME!\nHere you will find daily news and tutorials about R, 
     contributed by over 750 bloggers.\nThere are many ways to follow us -\nBy e-mail:\nOn Facebook:\nIf you are an R blogger yourself you are invited to add your own R content feed to this site (Non-English 
     R bloggers should add themselves- here)\nJOBS FOR R-USERS\nData/GIS Analyst for Ecoscape Environmental Consultants @ Kelowna, "

推荐答案

也许 webdriver,这是一个 PhantomJS 实现,会做得更好(目前无法针对 RSelenium 进行测试):

Maybe webdriver, which is a PhantomJS implementation, would do a better job (can't test against RSelenium at the moment):

library("webdriver")
library("rvest")

pjs <- run_phantomjs()
ses <- Session$new(port = pjs$port)
url <- 'https://www.r-bloggers.com'
ses$go(url)

res <- ses$getSource() %>% 
  read_html() %>%
  html_node('body') %>%
  html_text()

substring(res, 1, 500)
#> [1] "\n\n\n\t\t    \t    \t\n        \n        R news and tutorials contributed by (750) R bloggers         \n    Home\nAbout\nRSS\nadd your blog!\nLearn R\nR jobs\nSubmit a new job (it’s free)\n\tBrowse latest jobs (also free)\n\nContact us\n\n\n\n\n\n\n\n    \n\t\tWelcome!\t\t\t\n\n\n\n\nHere you will find daily news and tutorials about R, contributed by over 750 bloggers. \n\nThere are many ways to follow us - \nBy e-mail:\n\n\n<img src=\"https://feeds.feedburner.com/~fc/RBloggers?bg=99CCFF&amp;fg=444444&amp;anim=0\" height=\"26\" width=\"88\" sty"

这篇关于rvest 与 RSelenium 结果用于文本提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆