RCurl getURL循环链接到PDF杀死循环 [英] RCurl getURL with loop - link to a PDF kills looping

查看:235
本文介绍了RCurl getURL循环链接到PDF杀死循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我现在一直困惑这么长时间,似乎无法弄清楚如何解决这个问题。最简单的方法是给出一个可以工作的虚拟代码:

$ p $ require(RCurl)
require(XML)

#为curl设置一些选项
options(RCurlOptions = list(cainfo = system.file(CurlSSL,cacert.pem,package =RCurl)))
agent = Firefox / 23.0
curl = getCurlHandle()
curlSetOpt(
cookiejar ='cookies.txt',
useragent =代理,
followlocation = TRUE,
autoreferer = TRUE,
httpauth = 1L,#basichttp授权版本 - 这似乎对印度服务器有所改变
curl = curl



list1< - c('http://timesofindia.indiatimes.com//articleshow/2933112.cms',
'http://timesofindia.indiatimes.com//articleshow/2933131。 cms',
'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
'http://timesofindia.indiatimes.com//articleshow/2933277.cms')

#note list2在第二个位置插入一个新链接;这是杀死以下getURL调用的链接
list2< -c('http://timesofindia.indiatimes.com//articleshow/2933112.cms',
'http:// timesofindia。 indiatimes.com//articleshow/2933019.cms',
'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
'http://timesofindia.indiatimes.com/ /articleshow/2933209.cms',
'http://timesofindia.indiatimes.com//articleshow/2933277.cms')



for(i ($ list){
print(list1 [i])
html < -
try(getURL(
list1 [i],
maxredirs = as (html)==try-error){$ b .integer(20),
followlocation = TRUE,
curl = curl
),TRUE)
$ b print(paste(error access,list1 [i]))
rm(html)
gc()
next
} else {
print( (列表2)){$ b $(




gc()

) b print(list2 [i ])
html< -
try(getURL(
list2 [i],
maxredirs = as.integer(20),
followlocation = TRUE,
$ b),TRUE)
if(class(html)==try-error){
print(paste(error access,list2 [i]) )
rm(html)
gc()
next
} else {
print('success')
}
}

这应该能够在安装RCurl和XML库的情况下运行。当我将 http://timesofindia.indiatimes.com//articleshow/2933019.cms 插入到列表中的第二个位置时,它将杀死循环的其余部分(其他链接是相同的)。发生这种情况时(在这个和其他情况下一致),当链接包含PDF(检查看到)。

任何想法如何解决这个问题,一个PDF不会杀死我的循环?正如你所看到的,我试图清除可能有问题的对象, gc()遍布整个地方等,但我不明白为什么PDF杀死我的循环。

谢谢!

只要检查一下,我的输出是两个<$ c $对于循环:

 #[1]http://timesofindia.indiatimes.com $ articles / 2933112.cms
# [1]成功
#[1]http://timesofindia.indiatimes.com//articleshow/2933209.cms
#[1]成功
#[1 ]http://timesofindia.indiatimes.com//articleshow/2933277.cms
#[1]success

 #[1]http://timesofindia.indiatimes.com// articleshow / 2933112.cms
#[1]success
#[1]http://timesofindia.indiatimes.com//articleshow/2933019.cms
#[1 ]访问http://timesofindia.indiatime时出错s.com//articleshow/2933019.cms
#[1]http://timesofindia.indiatimes.com//articleshow/2933131.cms
#[1]访问http: //timesofindia.indiatimes.com//articleshow/2933131.cms
#[1]http://timesofindia.indiatimes.com//articleshow/2933209.cms
#[1]错误访问http://timesofindia.indiatimes.com//articleshow/2933209.cms
#[1]http://timesofindia.indiatimes.com//articleshow/2933277.cms
# [1]访问http://timesofindia.indiatimes.com//articleshow/2933277.cms错误


解决方案

您可能会发现使用httr更容易。它包装RCurl并默认设置你需要的选项。以下是与httr相同的代码:

  require(httr)

urls< - c(
'http://timesofindia.indiatimes.com//articleshow/2933112.cms',
'http://timesofindia.indiatimes.com//articleshow/2933019.cms',
' http://timesofindia.indiatimes.com//articleshow/2933131.cms',
'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
'http:// timesofindia.indiatimes.com//articleshow/2933277.cms'


回覆< - lapply(网址,GET)
sapply(回应,http_status)

sapply(response,function(x)headers(x)$`content-type`)


I've been puzzling this long enough now and can't seem to figure out how to get around it. Easiest to give working dummy code:

require(RCurl)
require(XML)

#set a bunch of options for curl
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
agent="Firefox/23.0" 
curl = getCurlHandle()
curlSetOpt(
  cookiejar = 'cookies.txt' ,
  useragent = agent,
  followlocation = TRUE ,
  autoreferer = TRUE ,
  httpauth = 1L, # "basic" http authorization version -- this seems to make a difference for India servers
  curl = curl
)


list1 <- c('http://timesofindia.indiatimes.com//articleshow/2933112.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933277.cms')

#note list2 has a new link inserted in 2nd position; this is the link that kills the following getURL calls
list2 <- c('http://timesofindia.indiatimes.com//articleshow/2933112.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933019.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933277.cms')



for ( i in seq( list1 ) ){
  print(list1[i])
  html <-
    try( getURL(
      list1[i],
      maxredirs = as.integer(20),
      followlocation = TRUE,
      curl = curl
    ),TRUE)
  if (class (html) == "try-error") {
    print(paste("error accessing",list1[i]))
    rm(html)
    gc()
    next
  } else {
    print('success')
  }
}


gc()

for ( i in seq( list2 ) ){
  print(list2[i])
  html <-
    try( getURL(
      list2[i],
      maxredirs = as.integer(20),
      followlocation = TRUE,
      curl = curl
    ),TRUE)
  if (class (html) == "try-error") {
    print(paste("error accessing",list2[i]))
    rm(html)
    gc()
    next
  } else {
    print('success')
  }
}

This should be able to run with RCurl and XML libraries installed. The point being that when I insert http://timesofindia.indiatimes.com//articleshow/2933019.cms into the second position in the list, it kills the success of the rest of the loop (other links are the same). This happens (in this and other circumstances consistently) when the link contains a PDF (check to see).

Any thoughts on how to fix this so getting a link that contains a PDF doesn't kill my loop? As you can see, I have tried to clear out the potentially offending object, gc() all over the place, etc. but I can't figure out why a PDF kills my loop.

Thanks!

Just to check, here is my output for the two for loops:

    #[1] "http://timesofindia.indiatimes.com//articleshow/2933112.cms"
    #[1] "success"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933131.cms"
    #[1] "success"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933209.cms"
    #[1] "success"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933277.cms"
    #[1] "success"

and

    #[1] "http://timesofindia.indiatimes.com//articleshow/2933112.cms"
    #[1] "success"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933019.cms"
    #[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933019.cms"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933131.cms"
    #[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933131.cms"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933209.cms"
    #[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933209.cms"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933277.cms"
    #[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933277.cms"

解决方案

You might find it easier to use httr. It wraps RCurl and sets the options you need by default. Here's the equivalent code with httr:

require(httr)

urls <- c(
  'http://timesofindia.indiatimes.com//articleshow/2933112.cms',
  'http://timesofindia.indiatimes.com//articleshow/2933019.cms',
  'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
  'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
  'http://timesofindia.indiatimes.com//articleshow/2933277.cms'
)

responses <- lapply(urls, GET)
sapply(responses, http_status)

sapply(responses, function(x) headers(x)$`content-type`)

这篇关于RCurl getURL循环链接到PDF杀死循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆