使用R从搜索结果URL中提取文本 [英] Extract text from search result URLs using R

查看:76
本文介绍了使用R从搜索结果URL中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对R有点了解,但对专业人士却不了解.我正在使用R进行文本挖掘项目.

I know R a bit, but not a pro. I am working on a text-mining project using R.

我在美联储网站上搜索了一个关键字,例如通货膨胀".搜索结果的第二页具有URL:(

I searched Federal Reserve website with a keyword, say ‘inflation’. The second page of the search result has the URL: (https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation).

此页面有10个搜索结果(10个URL).我想用R编写代码,以读取"与这10个URL中的每个URL对应的页面,并将这些网页中的文本提取到.txt文件中.我唯一的输入是上述URL.

This page has 10 search results (10 URLs). I want to write a code in R which will ‘read’ the page corresponding to each of those 10 URLs and extract the texts from those web pages to .txt files. My only input is the above mentioned URL.

感谢您的帮助.如果有类似的旧帖子,请也参考我.谢谢你.

I appreciate your help. If there is any similar older post, please refer me that too. Thank you.

推荐答案

这是如何删除此页面的基本思想.如果要剪贴的页面很多,则r的速度可能会很慢. 现在您的问题有点模棱两可.您希望最终结果是 .txt 文件.哪些网页包含pdf ???好的.您仍然可以使用此代码,并将包含pdf的网页的文件扩展名更改为pdf.

This is a basic idea of how to go about scrapping this pages. Though it might be slow in r if there are many pages to be scrapped. Now your question is a bit ambiguous. You want the end results to be .txt files. What of the webpages that has pdf??? Okay. you can still use this code and change the file extension to pdf for the webpages that have pdfs.

 library(xml2)
 library(rvest)

 urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"

  urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%
       .[!duplicated(.)]%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>%  
         Map(function(x,y) write_html(x,tempfile(y,fileext=".txt"),options="format"),.,
           c(paste("tmp",1:length(.))))

这是上面的代码的细分: 您要从中抓取的 url :

This is the breakdown of the code above: The url you want to scrap from:

 urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"

获取所需的所有 url :

  allurls <- urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%.[!duplicated(.)]

您想在哪里保存文本?创建临时文件:

Where do you want to save your texts?? Create the temp files:

 tmps <- tempfile(c(paste("tmp",1:length(allurls))),fileext=".txt")

按照现在.您的allurls是类字符.您必须将其更改为xml以便能够将其废弃.然后最后将它们写入上面创建的tmp文件中:

as per now. Your allurls is in class character. You have to change that to xml in order to be able to scrap them. Then finally write them into the tmp files created above:

  allurls%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>%  
         Map(function(x,y) write_html(x,y,options="format"),.,tmps)

请不要遗漏任何内容.例如,在..."format"),之后有一个句点.考虑到这一点. 现在,您的文件已写入 tempdir .要确定它们的位置,只需在控制台上键入命令tempdir(),它就会为您提供文件的位置.同时,您可以在tempfile命令中在剪贴时更改文件的位置.

Please do not leave anything out. For example after ..."format"), there is a period. Take that into consideration. Now your files have been written in the tempdir. To determine where they are, just type the command tempdir() on the console and it should give you the location of your files. At the same time, you can change the location of the files on scrapping within the tempfile command.

希望这会有所帮助.

这篇关于使用R从搜索结果URL中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆