从 URL 下载所有 PDF [英] Downloading all PDFs from URL

查看:43
本文介绍了从 URL 下载所有 PDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含数百个 PDF 的网站.我需要遍历,并将每个 PDF 下载到我的本地机器.我想使用 .尝试:

I have a website that has several hundred PDFs. I need to iterate through, and download every PDF to my local machine. I would like to use rvest. Attempt:

library(rvest)

url <- "https://example.com"

scrape <- url %>% 
  read_html() %>% 
  html_node(".ms-vb2 a") %>%
  download.file(., 'my-local-directory')

如何从链接中获取每个 PDF?download.file() 不起作用,我不知道如何获取每个文件.我刚刚收到此错误:

How do I grab each PDF from the link? The download.file() does not work, and I have no clue how to get each file. I just get this error:

doc_parse_raw(x, encoding = encoding, base_url = base_url,as_html = as_html, : xmlParseEntityRef: 无名称 [68]

Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, : xmlParseEntityRef: no name [68]

推荐答案

library(rvest)

url <- "https://example.com"
page<- html_session(url,config(ssl_verifypeer=FALSE))

links<-page %>% html_nodes(".ms-vb2 a") %>% html_attr("href")
subject<-page %>% html_nodes(".ms-vb2:nth-child(3)") %>% html_text()
name<-links<-page %>% html_nodes(".ms-vb2 a") %>% html_text()

for(i in 1:length(links)){
  pdf_page<-html_session(URLencode(paste0("https://example.com",links[i])),config(ssl_verifypeer=FALSE))
  writeBin(paste0(name[i],"-",subject[i],".pdf")
}

URL 是 http,所以必须使用 config(ssl_verifypeer=FALSE)

The URL is http so had to use the config(ssl_verifypeer=FALSE)

writeBin 根据您的需要命名文件.我刚刚把它命名为 ok_1.pdf ok_2.pdf 等等

writeBin name the file according to your necessity. I have just named it ok_1.pdf ok_2.pdf and so on

这篇关于从 URL 下载所有 PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆