tryCatch 函数适用于大多数不存在的 URL,但它在(至少)一种情况下不起作用 [英] tryCatch function works on most non-existent URLs, but it does not work in (at least) one case

查看:41
本文介绍了tryCatch 函数适用于大多数不存在的 URL,但它在(至少)一种情况下不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尊敬的 Stackoverflow 用户:

Dear Stackoverflow users,

我正在使用 R 从《今日心理学》中抓取一些心理治疗师的个人资料;这样做是为了锻炼和学习有关网页抓取的更多信息.

I am using R to scrape profiles of a few psycotherapists from Psychology Today; this is done for exercising and learning more about web scraping.

我是 R 的新手,我必须接受这种紧张的培训,这将有助于我完成未来的项目.这意味着我可能不知道我现在在做什么(例如,我可能无法很好地解释来自 R 的脚本或错误消息),但我必须完成它.因此,对于可能存在的误解或不准确之处,敬请原谅.

I am new to R and I I have to go through this intense training that will help me with a future projects. It implies that I might not know precisely what I am doing at the moment (e.g. I might not interpret well either the script or the error messages from R), but I have to get it done. Therefore, I beg your pardon for possible misunderstandings or inaccuracies.

简而言之,情况如下.我创建了一个函数,通过它我从心理治疗师的个人资料的 2 个节点中获取信息;该函数显示在此 stackoverflow 帖子上.

In short, the situation is the following. I have created a function through which I scrape information from 2 nodes of psycotherapists' profiles; the function is showed on this stackoverflow post.

然后我创建了一个循环,该函数用于一些心理治疗师的个人资料;循环也在上面的帖子中,但我在下面报告它,因为这是脚本的一部分,会产生一些问题(除了我在上面提到的帖子中解决的问题).

Then I create a loop where that function is used on a few psycotherapists' profiles; the loop is in the above post as well, but I report it below because that is the part of the script that generates some problems (additionally to what I solved in the above mentioned post).

j <- 1
MHP_codes <-  c(150140:150180) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
  for(code1 in MHP_codes) {
    URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
    #Reading the HTML code from the website
    URL <- read_html(URL)
    df_list[[j]] <- tryCatch(getProfile(URL), 
                             error = function(e) NA)
    j <- j + 1
  }

循环完成后,我将来自不同配置文件的信息绑定到一个数据框中并保存.

when the loop is done, I bind the information from different profiles into one data frame and save it.

final_df <- rbind.fill(df_list)
save(final_df,file="final_df.Rda")

函数 (getProfile) 可以很好地处理单个配置文件.它也适用于小范围的配置文件 (c(150100:150150)).请注意,我不知道实际分配的心理治疗师 ID;因此,该范围内的许多 URL 不存在.

The function (getProfile) works well on individual profiles. It works also on a small range of profiles ( c(150100:150150)). Please, note that I do not know what psychoterapist id is actually assigned; so, many URLs within the range do not exist.

不过一般来说,tryCatch应该可以解决这个问题.当 URL 不存在时(因此 ID 与任何心理治疗师无关),2 个节点中的每一个(因此我的数据框中的 2 个相应变量中的每一个)都是空的(即数据框在对应的单元格).

However, generally speaking, tryCatch should solve this issue. When an URL is non-existent (and thus the ID is not associated to any psychoterapist), each of the 2 nodes (and thus each of the 2 corresponding variables in my data frame) are empty (i.e. the data frame shows NAs in the corresponding cells).

但是,在某些 ID 范围内,可能会出现两个问题.

However, in some IDs ranges, two problems might happen.

首先,我收到一条错误消息,例如以下一条:

First, I get one error message such as teh following one:

open.connection(x, "rb") 中的错误:HTTP 错误 404.

因此,尽管我使用了 tryCatch 并且它通常似乎可以工作(至少在出现错误消息之前),但还是会发生这种情况.

So, this happens despite the fact that I am usign tryCatch and despite the fact that it generally appears to work (at least, until the error message appear).

此外,在循环停止并且 R 运行该行之后:

Moreover, after the loop is stopped and R runs the line:

final_df <- rbind.fill(df_list)

出现第二条错误消息:

警告信息:在 df[[var]] 中:关闭未使用的连接 3 (https://www.psychologytoday.com/us/therapists/illinois/150152)

该空 URL 似乎存在特定问题.事实上,当我更改 ID 范围时,尽管 URL 不存在,循环仍然运行良好:一方面,当 URL 存在时,从网站上抓取信息,另一方面,当 URL 不存在时,2 个变量与该 URL(从而与该心理治疗师 ID)相关联的用户获得 NA.

It seems like there is a specific problem with that one empty URL. In fact, when I change ID range, the loop works well despite non-existent URLs: on one hand, when the URL exists the information is scraped from the website, on the other hand, when the URL does not exists, the 2 variables associated to that URL (and thus to that psyciotherapist ID) get an NA.

如果 URL 为空,是否有可能告诉 R 跳过 URL?什么都不录?这个解决方案非常好,因为它会将数据框缩小到现有的 URL,但我不知道该怎么做,也不知道这是否是我问题的解决方案.

Is it possible, perhaps, to tell R to skip the URL if it is empty? Without recording anything? This solution would be excellent, since it would shrink the data frame to the existing URLs, but I do not know how to do it and I do not know whether it is a solution to my problem.

谁能帮我解决这个问题?

Anyone who is able to help me sorting out this issue?

推荐答案

是的,您需要在 read_html 调用周围包裹一个 tryCatch.这是 R 尝试连接到网站的地方,因此如果连接失败,它将在那里抛出错误(而不是返回空对象).您可以捕获该错误,然后使用 next 告诉 R 跳到循环的下一次迭代.

Yes, you need to wrap a tryCatch around the read_html call. This is where R tries to connect to the website, so it will throw an error (as opposed to returning an empty object) there if fails to connect. You can catch that error and then use next to tell R to skip to the next iteration of the loop.

library(rvest)
##Valid URL, works fine
URL <- "https://news.bbc.co.uk"
read_html(URL)

##Invalid URL, error raised
URL <- "https://news.bbc.co.uk/not_exist"
read_html(URL)
##Leads to error
Error in open.connection(x, "rb") : HTTP error 404.

##Invalid URL, catch and skip to next iteration of the loop
URL <- "https://news.bbc.co.uk/not_exist"
tryCatch({
URL <- read_html(URL)},
error=function(e) {print("URL Not Found, skipping")
                  next})

这篇关于tryCatch 函数适用于大多数不存在的 URL,但它在(至少)一种情况下不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆