如何使用RSelum从网页下载嵌入的PDF文件? [英] How to download embedded PDF files from webpage using RSelenium?

查看:0
本文介绍了如何使用RSelum从网页下载嵌入的PDF文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编辑:从我到目前为止收到的评论,我设法使用RSelum访问了我正在寻找的PDF文件,使用了以下代码:

library(RSelenium)
driver <- rsDriver(browser = "firefox")
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
# It needs some time to load the page
option <- remote_driver$findElement(using = 'xpath', "//select[@id='cmbGrupo']/option[@value='PDF|412']")
option$clickElement()
现在,我需要R来单击下载按钮,但我无法做到这一点。我已尝试:

button <- remote_driver$findElement(using = "xpath", "//*[@id='download']")
button$clickElement()

但我收到以下错误:

Selenium message:Unable to locate element: //*[@id="download"]
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'

Erro:    Summary: NoSuchElement
 Detail: An element could not be located on the page using the given search parameters.
 class: org.openqa.selenium.NoSuchElementException
 Further Details: run errorDetails method

谁能说出这里出了什么问题? 谢谢!

原问题:

我有几个网页,我需要从其中下载嵌入的PDF文件,我正在寻找使用R自动下载的方法。这是其中一个网页:https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398 这是CVM(Comissão de Valore Mobiliários,巴西相当于美国证券交易委员会-SEC)的网页,用于下载巴西公司的财务报表附注(Notas Explicativas)。

我尝试了几个选项,但网站的构建方式似乎很难提取直接链接。 我尝试了这里建议的Downloading all PDFs from URL,但html_nodes(".ms-vb2 a") %>% html_attr("href")产生一个空的字符向量。 同样,当我尝试此处的方法https://www.samuelworkman.org/blog/scraping-up-bits-of-helpfulness/时,html_attr("href")生成一个空向量。

我不习惯在R中使用Web抓取代码,所以我不知道发生了什么。 感谢您的帮助!

推荐答案

如果有人面临与我相同的问题,我将发布我使用的解决方案:

# set Firefox profile to download PDFs automatically
pdfprof <- makeFirefoxProfile(list(
  "pdfjs.disabled" = TRUE,
  "plugin.scan.plid.all" = FALSE,
  "plugin.scan.Acrobat" = "99.0",
  "browser.helperApps.neverAsk.saveToDisk" = 'application/pdf'))

driver <- rsDriver(browser = "firefox", extraCapabilities = pdfprof)
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)

option <- remote_driver$findElement(using = 'xpath', "//select[@id='cmbGrupo']/option[@value='PDF|412']") # select the option to open PDF file
option$clickElement()

# Find iframes in the webpage
web.elem <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem, function(x){x$getElementAttribute("id")}) # see their names
remote_driver$switchToFrame(web.elem[[1]]) # Move to the first iframe (Formularios Filho)
web.elem.2 <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem.2, function(x){x$getElementAttribute("id")}) # see their names
# The pdf Viewer iframe is the only one inside Formularios Filho
remote_driver$switchToFrame(web.elem.2[[1]]) # Move to the first iframe (pdf Viewer)
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)

# Download the PDF file
button <- remote_driver$findElement(using = "xpath", "//*[@id='download']")
button$clickElement() # download
Sys.sleep(3) # Need sometime to finish download and then close the window
remote_driver$close() # Close the window

这篇关于如何使用RSelum从网页下载嵌入的PDF文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆