如何优化使用getURL()在R中的抓取 [英] How to optimise scraping with getURL() in R
问题描述
我试图在法国下议院的网站上从两页中刮掉所有的法案。
I am trying to scrape all bills from two pages on the website of the French lower chamber of parliament. The pages cover 2002-2012 and represent less than 1,000 bills each.
为此,我使用 getURL
此循环:
b <- "http://www.assemblee-nationale.fr" # base
l <- c("12","13") # legislature id
lapply(l, FUN = function(x) {
print(data <- paste(b, x, "documents/index-dossier.asp", sep = "/"))
# scrape
data <- getURL(data); data <- readLines(tc <- textConnection(data)); close(tc)
data <- unlist(str_extract_all(data, "dossiers/[[:alnum:]_-]+.asp"))
data <- paste(b, x, data, sep = "/")
data <- getURL(data)
write.table(data,file=n <- paste("raw_an",x,".txt",sep="")); str(n)
})
有没有办法优化 getURL()
这里的功能?我似乎不能使用并发下载通过传递 async = TRUE
选项,这给我每次都一样的错误:
Is there any way to optimise the getURL()
function here? I cannot seem to use concurrent downloading by passing the async=TRUE
option, which gives me the same error every time:
Error in function (type, msg, asError = TRUE) :
Failed to connect to 0.0.0.12: No route to host
任何想法?感谢!
推荐答案
尝试mclapply {multicore}而不是lapply。
Try mclapply {multicore} instead of lapply.
mclapply是lapply的并行版本,它返回一个与x相同长度的
的列表,其中每个元素是将
FUN应用于相应元素的结果的X.
( http://www.rforge.net/doc/packages/ multicore / mclapply.html )
"mclapply is a parallelized version of lapply, it returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X." (http://www.rforge.net/doc/packages/multicore/mclapply.html)
如果这不起作用,您可以使用 XML 包。函数像xmlTreeParse使用异步调用。
If that doesn't work, you may get better performance using the XML package. Functions like xmlTreeParse use asynchronous calling.
注意xmlTreeParse允许一个混合风格的处理
允许我们应用处理程序到树中的节点,因为它们被
转换为R对象,这是一种事件驱动或
异步调用的风格。
( http://www.inside-r.org/ packages / cran / XML / docs / xmlEventParse )
这篇关于如何优化使用getURL()在R中的抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!