RCurl getURL循环链接到PDF杀死循环 [英] RCurl getURL with loop - link to a PDF kills looping
问题描述
$ p $
require(RCurl)
require(XML)
#为curl设置一些选项
options(RCurlOptions = list(cainfo = system.file(CurlSSL,cacert.pem,package =RCurl)))
agent = Firefox / 23.0
curl = getCurlHandle()
curlSetOpt(
cookiejar ='cookies.txt',
useragent =代理,
followlocation = TRUE,
autoreferer = TRUE,
httpauth = 1L,#basichttp授权版本 - 这似乎对印度服务器有所改变
curl = curl
)
list1< - c('http://timesofindia.indiatimes.com//articleshow/2933112.cms',
'http://timesofindia.indiatimes.com//articleshow/2933131。 cms',
'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
'http://timesofindia.indiatimes.com//articleshow/2933277.cms')
#note list2在第二个位置插入一个新链接;这是杀死以下getURL调用的链接
list2< -c('http://timesofindia.indiatimes.com//articleshow/2933112.cms',
'http:// timesofindia。 indiatimes.com//articleshow/2933019.cms',
'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
'http://timesofindia.indiatimes.com/ /articleshow/2933209.cms',
'http://timesofindia.indiatimes.com//articleshow/2933277.cms')
for(i ($ list){
print(list1 [i])
html < -
try(getURL(
list1 [i],
maxredirs = as (html)==try-error){$ b .integer(20),
followlocation = TRUE,
curl = curl
),TRUE)
$ b print(paste(error access,list1 [i]))
rm(html)
gc()
next
} else {
print( (列表2)){$ b $(
gc()
) b print(list2 [i ])
html< -
try(getURL(
list2 [i],
maxredirs = as.integer(20),
followlocation = TRUE,
$ b),TRUE)
if(class(html)==try-error){
print(paste(error access,list2 [i]) )
rm(html)
gc()
next
} else {
print('success')
}
}
这应该能够在安装RCurl和XML库的情况下运行。当我将 http://timesofindia.indiatimes.com//articleshow/2933019.cms
插入到列表中的第二个位置时,它将杀死循环的其余部分(其他链接是相同的)。发生这种情况时(在这个和其他情况下一致),当链接包含PDF(检查看到)。
任何想法如何解决这个问题,一个PDF不会杀死我的循环?正如你所看到的,我试图清除可能有问题的对象, gc()
遍布整个地方等,但我不明白为什么PDF杀死我的循环。
谢谢!
只要检查一下,我的输出是两个<$ c $对于循环:
#[1]http://timesofindia.indiatimes.com $ articles / 2933112.cms
# [1]成功
#[1]http://timesofindia.indiatimes.com//articleshow/2933209.cms
#[1]成功
#[1 ]http://timesofindia.indiatimes.com//articleshow/2933277.cms
#[1]success
和
#[1]http://timesofindia.indiatimes.com// articleshow / 2933112.cms
#[1]success
#[1]http://timesofindia.indiatimes.com//articleshow/2933019.cms
#[1 ]访问http://timesofindia.indiatime时出错s.com//articleshow/2933019.cms
#[1]http://timesofindia.indiatimes.com//articleshow/2933131.cms
#[1]访问http: //timesofindia.indiatimes.com//articleshow/2933131.cms
#[1]http://timesofindia.indiatimes.com//articleshow/2933209.cms
#[1]错误访问http://timesofindia.indiatimes.com//articleshow/2933209.cms
#[1]http://timesofindia.indiatimes.com//articleshow/2933277.cms
# [1]访问http://timesofindia.indiatimes.com//articleshow/2933277.cms错误
您可能会发现使用httr更容易。它包装RCurl并默认设置你需要的选项。以下是与httr相同的代码:
require(httr)
urls< - c(
'http://timesofindia.indiatimes.com//articleshow/2933112.cms',
'http://timesofindia.indiatimes.com//articleshow/2933019.cms',
' http://timesofindia.indiatimes.com//articleshow/2933131.cms',
'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
'http:// timesofindia.indiatimes.com//articleshow/2933277.cms'
)
回覆< - lapply(网址,GET)
sapply(回应,http_status)
sapply(response,function(x)headers(x)$`content-type`)
I've been puzzling this long enough now and can't seem to figure out how to get around it. Easiest to give working dummy code:
require(RCurl)
require(XML)
#set a bunch of options for curl
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
agent="Firefox/23.0"
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt' ,
useragent = agent,
followlocation = TRUE ,
autoreferer = TRUE ,
httpauth = 1L, # "basic" http authorization version -- this seems to make a difference for India servers
curl = curl
)
list1 <- c('http://timesofindia.indiatimes.com//articleshow/2933112.cms',
'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
'http://timesofindia.indiatimes.com//articleshow/2933277.cms')
#note list2 has a new link inserted in 2nd position; this is the link that kills the following getURL calls
list2 <- c('http://timesofindia.indiatimes.com//articleshow/2933112.cms',
'http://timesofindia.indiatimes.com//articleshow/2933019.cms',
'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
'http://timesofindia.indiatimes.com//articleshow/2933277.cms')
for ( i in seq( list1 ) ){
print(list1[i])
html <-
try( getURL(
list1[i],
maxredirs = as.integer(20),
followlocation = TRUE,
curl = curl
),TRUE)
if (class (html) == "try-error") {
print(paste("error accessing",list1[i]))
rm(html)
gc()
next
} else {
print('success')
}
}
gc()
for ( i in seq( list2 ) ){
print(list2[i])
html <-
try( getURL(
list2[i],
maxredirs = as.integer(20),
followlocation = TRUE,
curl = curl
),TRUE)
if (class (html) == "try-error") {
print(paste("error accessing",list2[i]))
rm(html)
gc()
next
} else {
print('success')
}
}
This should be able to run with RCurl and XML libraries installed. The point being that when I insert http://timesofindia.indiatimes.com//articleshow/2933019.cms
into the second position in the list, it kills the success of the rest of the loop (other links are the same). This happens (in this and other circumstances consistently) when the link contains a PDF (check to see).
Any thoughts on how to fix this so getting a link that contains a PDF doesn't kill my loop? As you can see, I have tried to clear out the potentially offending object, gc()
all over the place, etc. but I can't figure out why a PDF kills my loop.
Thanks!
Just to check, here is my output for the two for
loops:
#[1] "http://timesofindia.indiatimes.com//articleshow/2933112.cms"
#[1] "success"
#[1] "http://timesofindia.indiatimes.com//articleshow/2933131.cms"
#[1] "success"
#[1] "http://timesofindia.indiatimes.com//articleshow/2933209.cms"
#[1] "success"
#[1] "http://timesofindia.indiatimes.com//articleshow/2933277.cms"
#[1] "success"
and
#[1] "http://timesofindia.indiatimes.com//articleshow/2933112.cms"
#[1] "success"
#[1] "http://timesofindia.indiatimes.com//articleshow/2933019.cms"
#[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933019.cms"
#[1] "http://timesofindia.indiatimes.com//articleshow/2933131.cms"
#[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933131.cms"
#[1] "http://timesofindia.indiatimes.com//articleshow/2933209.cms"
#[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933209.cms"
#[1] "http://timesofindia.indiatimes.com//articleshow/2933277.cms"
#[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933277.cms"
You might find it easier to use httr. It wraps RCurl and sets the options you need by default. Here's the equivalent code with httr:
require(httr)
urls <- c(
'http://timesofindia.indiatimes.com//articleshow/2933112.cms',
'http://timesofindia.indiatimes.com//articleshow/2933019.cms',
'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
'http://timesofindia.indiatimes.com//articleshow/2933277.cms'
)
responses <- lapply(urls, GET)
sapply(responses, http_status)
sapply(responses, function(x) headers(x)$`content-type`)
这篇关于RCurl getURL循环链接到PDF杀死循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!