如何在网络上抓取包含在R中链接的子链接中的文本？ [英] How to webscrape texts that are contained into sublinks of a link in R?

查看：0 发布时间：2022/9/2 18:00:04 r web-scraping rvest web-scraping-language

本文介绍了如何在网络上抓取包含在R中链接的子链接中的文本？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试对此website进行网络擦除。

如您所见，有一个主链接和一系列标题，您可以单击它们来访问文本。我最终想要得到的是主链接的所有这些子链接中的文本。我不太熟悉网络抓取，所以我四处看看，我想大概是这样的：


library(rvest)

x <- read_html("https://www.ecb.europa.eu/press/pressconf/html/index.en.html")

x1 <- html_nodes(x, ".doc-title a") # this using selector gadget

无论这种尝试多么严重地失败了。有谁能帮我吗？

非常感谢！

推荐答案

可以获取初始页面的链接文本：

library(RSelenium)
library(rvest)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("https://www.ecb.europa.eu/press/pressconf/html/index.en.html")

# This is useful to load all the page
for(i in 1 : 100)
{
  print(i)
  remDr$executeScript(paste0("scroll(0, ", i * 2000, ")"))
}

Sys.sleep(5)
html_Content <- remDr$getPageSource()[[1]]
html_Link <- str_extract_all(string = html_Content, pattern = "/press/pressconf/[^<]*html")[[1]]
html_Link_En <- html_Link[str_detect(html_Link, "\.en\.html")]
links_To_Remove <- c("/press/pressconf/html/index.en.html", "/press/pressconf/visual-mps/html/index.en.html" )
html_Link_En <- html_Link_En[!(html_Link_En %in% links_To_Remove)]
html_Link_En <- unique(html_Link_En)

# Extract text from first link
# It is possible to use a for loop to get the text of all links ...
html_Content <- read_html(paste0("https://www.ecb.europa.eu", html_Link_En[1]))
html_Content %>% html_text()

这篇关于如何在网络上抓取包含在R中链接的子链接中的文本？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在网络上抓取包含在R中链接的子链接中的文本？ [英] How to webscrape texts that are contained into sublinks of a link in R?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在网络上抓取包含在R中链接的子链接中的文本？ [英] How to webscrape texts that are contained into sublinks of a link in R?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭