使用并行化用R抓取网页 [英] Using parallelisation to scrape web pages with R

查看：83 发布时间：2020/5/24 21:04:15 xml r parallel-processing

本文介绍了使用并行化用R抓取网页的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试抓取大量网页，以便以后对其进行分析.由于URL的数量巨大，因此我决定将parallel包与XML一起使用.

I am trying to scrape a large amount of web pages to later analyse them. Since the number of URLs is huge, I had decided to use the parallel package along with XML.

具体地说，我正在使用XML中的htmlParse()函数，该函数在与sapply一起使用时可以正常工作，但是在与parSapply一起使用时会生成HTMLInternalDocument类的空对象.

Specifically, I am using the htmlParse() function from XML, which works fine when used with sapply, but generates empty objects of class HTMLInternalDocument when used with parSapply.

url1<- "http://forums.philosophyforums.com/threads/senses-of-truth-63636.html"
url2<- "http://forums.philosophyforums.com/threads/the-limits-of-my-language-impossibly-mean-the-limits-of-my-world-62183.html"
url3<- "http://forums.philosophyforums.com/threads/how-language-models-reality-63487.html"

myFunction<- function(x){
cl<- makeCluster(getOption("cl.cores",detectCores()))
ok<- parSapply(cl=cl,X=x,FUN=htmlParse)
return(ok)
}

urls<- c(url1,url2,url3)

#Works
output1<- sapply(urls,function(x)htmlParse(x))
str(output1[[1]])
> Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument', 'oldClass' <externalptr>
output1[[1]]


#Doesn't work
myFunction<- function(x){
cl<- makeCluster(getOption("cl.cores",detectCores()))
ok<- parSapply(cl=cl,X=x,FUN=htmlParse)
stopCluster(cl)
return(ok)
}

output2<- myFunction(urls)
str(output2[[1]])
> Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument', 'oldClass' <externalptr>
output2[[1]]
#empty

谢谢.

推荐答案

您可以使用Rcurl包中的getURIAsynchronous，该包允许调用者同时指定多个URI进行下载.

You can use getURIAsynchronous from Rcurl package that allows the caller to specify multiple URIs to download at the same time.

library(RCurl)
library(XML)
get.asynch <- function(urls){
  txt <- getURIAsynchronous(urls)
  ## this part can be easily parallelized 
  ## I am juste using lapply here as first attempt
  res <- lapply(txt,function(x){
    doc <- htmlParse(x,asText=TRUE)
    xpathSApply(doc,"/html/body/h2[2]",xmlValue)
  })
}

get.synch <- function(urls){
  lapply(urls,function(x){
    doc <- htmlParse(x)
    res2 <- xpathSApply(doc,"/html/body/h2[2]",xmlValue)
    res2
  })}

这里有一些针对100个网址的基准测试，您可以将解析时间除以2.

Here some benchmarking for 100 urls you divide the parsing time by a factor of 2.

library(microbenchmark)
uris = c("http://www.omegahat.org/RCurl/index.html")
urls <- replicate(100,uris)
microbenchmark(get.asynch(urls),get.synch(urls),times=1)

Unit: seconds
             expr      min       lq   median       uq      max neval
 get.asynch(urls) 22.53783 22.53783 22.53783 22.53783 22.53783     1
  get.synch(urls) 39.50615 39.50615 39.50615 39.50615 39.50615     1

这篇关于使用并行化用R抓取网页的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用并行化用R抓取网页 [英] Using parallelisation to scrape web pages with R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用并行化用R抓取网页 [英] Using parallelisation to scrape web pages with R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭