如何从 R 中的文档搜索 Web 界面抓取/自动下载 PDF 文件? [英] How do I scrape / automatically download PDF files from a document search web interface in R?

查看:79
本文介绍了如何从 R 中的文档搜索 Web 界面抓取/自动下载 PDF 文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 R 编程语言进行 NLP(自然语言处理)分析 - 为此,我需要网页抓取"互联网上的公开信息.

I am using the R programming language for NLP (natural language process) analysis - for this, I need to "webscrape" publicly available information on the internet.

最近,我学会了如何爬网"来自我正在使用的网站的单个 pdf 文件:

Recently, I learned how to "webscrape" a single pdf file from the website I am using :

library(pdftools)
library(tidytext)
library(textrank)
library(dplyr)
library(tibble)

#this is an example of a single pdf
url <- "https://www.canlii.org/en/ns/nswcat/doc/2013/2013canlii47876/2013canlii47876.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words <- article_words %>%
  anti_join(stop_words, by = "word")

#this final command can take some time to run
article_summary <- textrank_sentences(data = article_sentences, terminology = article_words)

#Sources: https://stackoverflow.com/questions/66979242/r-error-in-textrank-sentencesdata-article-sentences-terminology-article-w  ,  https://www.hvitfeldt.me/blog/tidy-text-summarization-using-textrank/

如果您想手动访问单个网站然后webscrape",上面的代码可以正常工作这个网站.现在,我想尝试自动同时下载 10 篇这样的文章,而无需手动访问每个页面.例如,假设我想从这个网站下载前 10 个 pdf:https://www.canlii.org/en/#search/type=decision&text=dog%20toronto

The above code works fine if you want to manually access a single website and then "webscrape" this website. Now, I want to try and automatically download 10 such articles at the same time, without manually visiting each page. For instance, suppose I want to download the first 10 pdf's from this website: https://www.canlii.org/en/#search/type=decision&text=dog%20toronto

我想我找到了以下网站,其中讨论了如何做类似的事情(我为我的示例修改了代码):https://towardsdatascience.com/scraping-downloading-and-storing-pdfs-in-r-367a0a6d9199

I think I found the following website which discusses how to do something similar (I adapted the code for my example): https://towardsdatascience.com/scraping-downloading-and-storing-pdfs-in-r-367a0a6d9199

library(tidyverse)
library(rvest)
library(stringr)

page <- read_html("https://www.canlii.org/en/#search/type=decision&text=dog%20toronto ")

raw_list <- page %>% 
    html_nodes("a") %>%  
    html_attr("href") %>% 
    str_subset("\\.pdf") %>% 
    str_c("https://www.canlii.org/en/#search/type=decision&text=dog", .) 
    map(read_html) %>% 
    map(html_node, "#raw-url") %>% 
    map(html_attr, "href") %>% 
    str_c("https://www.canlii.org/en/#search/type=decision&text=dog", .) %>% 
    walk2(., basename(.), download.file, mode = "wb") 

但这会产生以下错误:

Error in .f(.x[[1L]], .y[[1L]], ...) : scheme not supported in URL 'NA'

有人可以告诉我我做错了什么吗?是否可以下载本网站上出现的前 10 个 pdf 文件并将它们分别保存在 R 中为pdf1"、pdf2"、...pdf9"、pdf10"?

Can someone please show me what I am doing wrong? Is it possible to download the first 10 pdf files that appear on this website and save them individually in R as "pdf1", "pdf2", ... "pdf9", "pdf10"?

谢谢

推荐答案

我看到有人建议你使用 rselenium,这是一种模拟浏览器操作,以便 Web 服务器将页面呈现为如果有人访问该站点.根据我的经验,几乎从来没有必须走那条路.该网站的 javascript 部分是与 API 交互,我们可以利用它来绕过 Javascript部分并直接获取原始json数据.在 Firefox 中(和 Chrome 在这方面相似,我假设)您可以右键单击该网站并选择检查元素(Q)",转到网络"选项卡,然后单击重新加载.你会看到每个请求浏览器使网络服务器在几秒钟或更短的时间内被列出.我们对具有类型"json 的那些感兴趣.当您右键单击一个条目时,您可以选择在新选项卡中打开".其中一个返回 json 的请求附加了以下 URL https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1在 Firefox 中打开该 URL 会带您进入一个 GUI,您可以在其中探索json 数据结构,你会看到有一个结果"条目包含搜索的前 25 个结果的数据.每个人都有一个路径"条目,指向将显示嵌入 PDF 的页面.事实证明,如果您将.html"部分替换为.pdf"该路径直接指向PDF文件.下面的代码利用了所有这些信息.

I see some people suggesting that you use rselenium, which is a way to simulate browser actions, so that the web server renders the page as if a human was visiting the site. From my experience it is almost never necessary to go down that route. The javascript part of the website is interacting with an API and we can utilize that to circumvent the Javascript part and get the raw json data directly. In Firefox (and Chrome is similar in that regard I assume) you can right-click on the website and select "Inspect Element (Q)", go to the "Network" tab and click on reload. You’ll see that each request the browser makes to the webserver is being listed after a few seconds or less. We are interested in the ones that have the "Type" json. When you right click on an entry you can select "Open in New Tab". One of the requests that returns json has the following URL attached to it https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1 Opening that URL in Firefox gets you to a GUI that lets you explore the json data structure and you’ll see that there is a "results" entry which contains the data for the 25 first results of your search. Each one has a "path" entry, that leads to the page that will display the embedded PDF. It turns out that if you replace the ".html" part with ".pdf" that path leads directly to the PDF file. The code below utilizes all this information.

library(tidyverse) # tidyverse for the pipe and for `purrr::map*()` functions.
library(httr) # this should already be installed on your machine as `rvest` builds on it
library(pdftools)
#> Using poppler version 20.09.0
library(tidytext)
library(textrank)

base_url <- "https://www.canlii.org"

json_url_search_p1 <-
  "https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1"

这将下载第 1 页/结果 1 到 25 的 json

This downloads the json for page 1 / results 1 to 25

results_p1 <-
  GET(json_url_search_p1, encode = "json") %>%
  content()

对于每个结果,我们只提取路径.

For each result we extract the path only.

result_html_paths_p1 <-
  map_chr(results_p1$results,
          ~ .$path)

我们将.html"替换为.pdf",将基本 URL 与路径组合到生成指向 PDF 的完整 URL.最后我们把它传送到 purrr::map()pdftools::pdf_text 以便从所有 25 个 PDF 中提取文本.

We replace ".html" with ".pdf", combine the base URL with the path to generate the full URLs pointing to the PDFs. Last we pipe it into purrr::map() and pdftools::pdf_text in order to extract the text from all 25 PDFs.

pdf_texts_p1 <-
  gsub(".html$", ".pdf", result_html_paths_p1) %>%
  paste0(base_url, .) %>%
  map(pdf_text)

如果您不仅要为第一页执行此操作,您可能还想这样做将上述代码包装在一个函数中,该函数可让您切换&page="范围.您还可以使&text="参数成为功能,以便为其他搜索自动抓取结果.

If you want to do this for more than just the first page you might want to wrap the above code in a function that lets you switch out the "&page=" parameter. You could also make the "&text=" parameter an argument of the function in order to automatically scrape results for other searches.

对于任务的其余部分,我们可以在您已有的代码的基础上进行构建.我们使它成为一个可以应用于任何文章的功能并应用该功能再次使用 purrr::map() 到每个 PDF 文本.

For the remaining part of the task we can build on the code you already have. We make it a function that can be applied to any article and apply that function to each PDF text again using purrr::map().

extract_article_summary <-
  function(article) {
    article_sentences <- tibble(text = article) %>%
      unnest_tokens(sentence, text, token = "sentences") %>%
      mutate(sentence_id = row_number()) %>%
      select(sentence_id, sentence)
    
    
    article_words <- article_sentences %>%
      unnest_tokens(word, sentence)
    
    
    article_words <- article_words %>%
      anti_join(stop_words, by = "word")
    
    textrank_sentences(data = article_sentences, terminology = article_words)
  }

现在这将需要很长时间!

This now will take a real long time!

article_summaries_p1 <- 
  map(pdf_texts_p1, extract_article_summary)

或者,您可以使用 furrr::future_map() 代替来利用所有 CPU核心并加快进程.

Alternatively you could use furrr::future_map() instead to utilize all the CPU cores in your machine and speed up the process.

library(furrr) # make sure the package is installed first
plan(multisession)
article_summaries_p1 <- 
  future_map(pdf_texts_p1, extract_article_summary)

免责声明

以上答案中的代码仅用于教育目的.与许多网站一样,此服务限制对其内容的自动访问.robots.txt 明确禁止访问 /search 路径通过机器人.因此,建议在下载大量数据之前与站点所有者取得联系.canlii 根据个人请求提供 API 访问,请参阅此处的文档.这将是访问他们数据的正确和最安全的方式.

Disclaimer

The code in the answer above is for educational purposes only. As many websites do, this service restricts automated access to its contents. The robots.txt explicitly disallows the /search path from being accessed by bots. It is therefore recommended to get in contact with the site owner before downloading big amounts of data. canlii offers API access on an individual request basis, see documentation here. This would be the correct and safest way to access their data.

这篇关于如何从 R 中的文档搜索 Web 界面抓取/自动下载 PDF 文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆