如何优化使用getURL（）在R中的抓取 [英] How to optimise scraping with getURL() in R

查看：840 发布时间：2017/3/6 1:04:49 r curl web-scraping

本文介绍了如何优化使用getURL（）在R中的抓取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图在法国下议院的网站上从两页中刮掉所有的法案。

I am trying to scrape all bills from two pages on the website of the French lower chamber of parliament. The pages cover 2002-2012 and represent less than 1,000 bills each.

为此，我使用 getURL 此循环：

b <- "http://www.assemblee-nationale.fr" # base
l <- c("12","13") # legislature id

lapply(l, FUN = function(x) {
  print(data <- paste(b, x, "documents/index-dossier.asp", sep = "/"))

  # scrape
  data <- getURL(data); data <- readLines(tc <- textConnection(data)); close(tc)
  data <- unlist(str_extract_all(data, "dossiers/[[:alnum:]_-]+.asp"))
  data <- paste(b, x, data, sep = "/")
  data <- getURL(data)
  write.table(data,file=n <- paste("raw_an",x,".txt",sep="")); str(n)
})

有没有办法优化 getURL（）这里的功能？我似乎不能使用并发下载通过传递 async = TRUE 选项，这给我每次都一样的错误：

Is there any way to optimise the getURL() function here? I cannot seem to use concurrent downloading by passing the async=TRUE option, which gives me the same error every time:

Error in function (type, msg, asError = TRUE)  : 
Failed to connect to 0.0.0.12: No route to host

任何想法？感谢！

如何优化使用getURL（）在R中的抓取 [英] How to optimise scraping with getURL() in R

问题描述

推荐答案

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录关闭

如何优化使用getURL（）在R中的抓取 [英] How to optimise scraping with getURL() in R

问题描述

推荐答案

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录 关闭

登录关闭