从HTML网站抓取pdf文件 [英] Web scraping pdf files from HTML

查看:1009
本文介绍了从HTML网站抓取pdf文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何从HTML删除pdf文档?我正在使用R,并且只能从HTML中提取文本.我要剪贴的网站示例如下.

How can I scrap the pdf documents from HTML? I am using R and I can do only extract the text from HTML. The example of the website that I am going to scrap is as follows.

https://www.bot.or .th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx

致谢

推荐答案

当您说要从HTML页面中抓取PDF文件时,我认为您面临的第一个问题是实际识别这些PDF文件的位置./p>

When you say you want to scrape the PDF files from HTML pages, I think the first problem you face is to actually identify the location of those PDF files.

library(XML)
library(RCurl)

url <- "https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx"
page   <- getURL(url)
parsed <- htmlParse(page)
links  <- xpathSApply(parsed, path="//a", xmlGetAttr, "href")
inds   <- grep("*.pdf", links)
links  <- links[inds]

links包含您要下载的PDF文件的所有URL.

links contains all the URLs to the PDF-files you are trying to download.

提防:当您自动抓取文档并被阻止时,许多网站都不喜欢它.

Beware: many websites don't like it very much when you automatically scrape their documents and you get blocked.

有了这些链接,您就可以开始循环浏览这些链接,并一一下载它们,并将其以destination的名称保存在您的工作目录中.我决定根据链接为您的PDF提取合理的文档名称(提取网址中最后一个/之后的最后一段

With the links in place, you can start looping through the links and download them one by one and saving them in your working directory under the name destination. I decided to extract reasonable document names for your PDFs, based on the links (extracting the final piece after the last / in the urls

regex_match <- regexpr("[^/]+$", links, perl=TRUE)
destination <- regmatches(links, regex_match)

为避免网站的服务器超负荷,我听说偶尔暂停一下抓取操作是很友好的,因此,我使用Sys.sleep()将抓取操作暂停了0到5秒之间的时间:

To avoid overloading the servers of the website, I have heard it is friendly to pause your scraping every once in a while, so therefore I use 'Sys.sleep()` to pause scraping for a time between 0 and 5 seconds:

for(i in seq_along(links)){
  download.file(links[i], destfile=destination[i])
  Sys.sleep(runif(1, 1, 5))
}

这篇关于从HTML网站抓取pdf文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆