从 URL 下载所有 PDF [英] Downloading all PDFs from URL
问题描述
我有一个包含数百个 PDF 的网站.我需要遍历,并将每个 PDF 下载到我的本地机器.我想使用 rvest.尝试:
I have a website that has several hundred PDFs. I need to iterate through, and download every PDF to my local machine. I would like to use rvest. Attempt:
library(rvest)
url <- "https://example.com"
scrape <- url %>%
read_html() %>%
html_node(".ms-vb2 a") %>%
download.file(., 'my-local-directory')
如何从链接中获取每个 PDF?download.file()
不起作用,我不知道如何获取每个文件.我刚刚收到此错误:
How do I grab each PDF from the link? The download.file()
does not work, and I have no clue how to get each file. I just get this error:
doc_parse_raw(x, encoding = encoding, base_url = base_url,as_html = as_html, : xmlParseEntityRef: 无名称 [68]
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, : xmlParseEntityRef: no name [68]
推荐答案
library(rvest)
url <- "https://example.com"
page<- html_session(url,config(ssl_verifypeer=FALSE))
links<-page %>% html_nodes(".ms-vb2 a") %>% html_attr("href")
subject<-page %>% html_nodes(".ms-vb2:nth-child(3)") %>% html_text()
name<-links<-page %>% html_nodes(".ms-vb2 a") %>% html_text()
for(i in 1:length(links)){
pdf_page<-html_session(URLencode(paste0("https://example.com",links[i])),config(ssl_verifypeer=FALSE))
writeBin(paste0(name[i],"-",subject[i],".pdf")
}
URL 是 http,所以必须使用 config(ssl_verifypeer=FALSE)
The URL is http so had to use the config(ssl_verifypeer=FALSE)
writeBin
根据您的需要命名文件.我刚刚把它命名为 ok_1.pdf
ok_2.pdf
等等
writeBin
name the file according to your necessity. I have just named it ok_1.pdf
ok_2.pdf
and so on
这篇关于从 URL 下载所有 PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!