循环抓取,避免 404 错误 [英] Scrape with a loop and avoid 404 error

查看:44
本文介绍了循环抓取,避免 404 错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为我的项目抓取 wiki 中某些与天文学相关的定义.该代码运行良好,但我无法避免 404.我试过 tryCatch.我想我在这里遗漏了一些东西.

I am trying to scrape wiki for certain astronomy related definitions for my project. The code works pretty well, but I am not able to avoid 404s. I tried tryCatch. I think I am missing something here.

我正在寻找一种在运行循环时克服 404 的方法.这是我的代码:

I am looking for a way overcome 404s while running a loop. Here is my code:

library(rvest)
library(httr)
library(XML)
library(tm)


topic<-c("Neutron star", "Black hole", "sagittarius A")

for(i in topic){

  site<- paste("https://en.wikipedia.org/wiki/", i)
  site <- read_html(site)

  stats<- xmlValue(getNodeSet(htmlParse(site),"//p")[[1]]) #only the first paragraph
  #error = function(e){NA}

  stats[["topic"]] <- i

  stats<- gsub('\\[.*?\\]', '', stats)
  #stats<-stats[!duplicated(stats),]
  #out.file <- data.frame(rbind(stats,F[i]))

  output<-rbind(stats,i)

}

推荐答案

  1. 使用 sprintf 在循环中构建变量 url.
  2. 从段落节点中提取所有正文文本.
  3. 删除任何返回长度(0)的向量
  4. 我添加了一个步骤来包含所有由前置 [paragraph - n] 注释的正文文本以供参考..因为...朋友不要让朋友浪费数据或制作多个http 请求.
  5. 为主题列表中的每次迭代构建一个数据框,格式如下:
  6. 将列表中的所有 data.frames 绑定到一个...

  1. Build the variable urls in the loop using sprintf.
  2. Extract all the body text from paragraph nodes.
  3. Remove any vectors returning a length(0)
  4. I added a step to include all of the body text annotated by a prepended [paragraph - n] for reference..because well...friends don't let friends waste data or make multiple http requests.
  5. Build a data frame for each iteration in your topics list in the form of below:
  6. Bind all of the data.frames in the list into one...

wiki_url : 应该很明显

wiki_url : should be obvious

all_info:如果您需要更多..你知道.

all_info: In case you need more..ya know.

请注意,我使用的是较旧的 rvest 源代码版本

Note that I use an older, source version of rvest

为了便于理解,我只是将名称 html 分配给您的 read_html.

for ease of understanding i'm simply assigning the name html to what would be your read_html.

   library(rvest)
   library(jsonlite)

   html <- rvest::read_html

   wiki_base <- "https://en.wikipedia.org/wiki/%s"

   my_table <- lapply(sprintf(wiki_base, topic), function(i){

        raw_1 <- html_text(html_nodes(html(i),"p"))

        raw_valid <- raw_1[nchar(raw_1)>0]

        all_info <- lapply(1:length(raw_valid), function(i){
            sprintf(' [paragraph - %d] %s ', i, raw_valid[[i]])
        }) %>% paste0(collapse = "")

        data.frame(wiki_url = i, 
                   topic = basename(i),
                   info_summary = raw_valid[[1]],
                   trimws(all_info),
                   stringsAsFactors = FALSE)

    }) %>% rbind.pages

   > str(my_table)
   'data.frame':    3 obs. of  4 variables:
    $ wiki_url    : chr  "https://en.wikipedia.org/wiki/Neutron star"     "https://en.wikipedia.org/wiki/Black hole" "https://en.wikipedia.org/wiki/sagittarius A"
    $ topic       : chr  "Neutron star" "Black hole" "sagittarius A"
    $ info_summary: chr  "A neutron star is the collapsed core of a large star (10–29 solar masses). Neutron stars are the smallest and densest stars kno"| __truncated__ "A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing—not even particles and electrom"| __truncated__ "Sagittarius A or Sgr A is a complex radio source at the center of the Milky Way. It is located in the constellation Sagittarius"| __truncated__
    $ all_info    : chr  " [paragraph - 1] A neutron star is the collapsed core of a large star (10–29 solar masses). Neutron stars are the smallest and "| __truncated__ " [paragraph - 1] A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing—not even parti"| __truncated__ " [paragraph - 1] Sagittarius A or Sgr A is a complex radio source at the center of the Milky Way. It is located in the constell"| __truncated__

编辑

一个用于错误处理的函数......返回一个逻辑.所以这成为我们的第一步.

EDIT

A function for error handling.... returns a logical. So this becomes our first step.

url_works <- function(url){
tryCatch(
    identical(status_code(HEAD(url)),200L), 
    error = function(e){
        FALSE
    })
}

基于您对系外行星"的使用,以下是来自维基页面的所有适用数据:

Based on your use of 'exoplanet' Here is all of the applicable data from the wiki page:

 exo_data <- (html_nodes(html('https://en.wikipedia.org/wiki/List_of_exoplanets'),'.wikitable')%>%html_table)[[2]]

str(exo_data)

    'data.frame':   2048 obs. of  16 variables:
 $ Name                          : chr  "Proxima Centauri b" "KOI-1843.03" "KOI-1843.01" "KOI-1843.02" ...
 $ bf                            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Mass (Jupiter mass)           : num  0.004 0.0014 NA NA 0.1419 ...
 $ Radius (Jupiter radii)        : num  NA 0.054 0.114 0.071 1.012 ...
 $ Period (days)                 : num  11.186 0.177 4.195 6.356 19.224 ...
 $ Semi-major axis (AU)          : num  0.05 0.0048 0.039 0.052 0.143 0.229 0.0271 0.053 1.33 2.1 ...
 $ Ecc.                          : num  0.35 1.012 NA NA 0.0626 ...
 $ Inc. (deg)                    : num  NA 72 89.4 88.2 87.1 ...
 $ Temp. (K)                     : num  234 NA NA NA 707 ...
 $ Discovery method              : chr  "radial vel." "transit" "transit" "transit" ...
 $ Disc. Year                    : int  2016 2012 2012 2012 2010 2010 2010 2014 2009 2005 ...
 $ Distance (pc)                 : num  1.29 NA NA NA 650 ...
 $ Host star mass (solar masses) : num  0.123 0.46 0.46 0.46 1.05 1.05 1.05 0.69 1.25 0.22 ...
 $ Host star radius (solar radii): num  0.141 0.45 0.45 0.45 1.23 1.23 1.23 NA NA NA ...
 $ Host star temp. (K)           : num  3024 3584 3584 3584 5722 ...
 $ Remarks                       : chr  "Closest exoplanet to our Solar System. Within host star’s habitable zone; possibl
 y Earth-like." "controversial" "controversial" "controversial" ...

在表的随机样本上测试我们的 url_works 函数

test our url_works function on random sample of the table

tests <- dplyr::sample_frac(exo_data, 0.02) %>% .$Name

现在让我们构建一个带有名称、要检查的 url 以及 url 是否有效的逻辑的 ref 表,并在一个步骤中创建一个包含两个数据框的列表,其中一个包含不存在的 url....和另一个这样做.检查出来的我们可以毫无问题地运行上述功能.这样错误处理就在我们真正开始尝试在循环中解析之前完成.避免头痛,并提供需要进一步研究的项目的参考确认.

Now lets build a ref table with the Name, url to check, and a logical if the url is valid, and in one step create a list of two data frames, one containing the urls that don't exists....and the other that do. The ones that check out we can run through the above function with no issues. This way the error handling is done before we actually start trying to parse in a loop. Avoids headaches and gives a reference ack to what items need to be further looked into.

b <- ldply(sprintf('https://en.wikipedia.org/wiki/%s',tests), function(i){
data.frame(name = basename(i), url_checked = i,url_valid = url_works(i))
}) %>%split(.$url_valid)

> str(b)
List of 2
 $ FALSE:'data.frame':  24 obs. of  3 variables:
  ..$ name       : chr [1:24] "Kepler-539c" "HD 142 A c" "WASP-44 b" "Kepler-280 b" ...
  ..$ url_checked: chr [1:24] "https://en.wikipedia.org/wiki/Kepler-539c" "https://en.wikipedia.org/wiki/HD 142 A c" "https://en.wikipedia.org/wiki/WASP-44 b" "https://en.wikipedia.org/wiki/Kepler-280 b" ...
  ..$ url_valid  : logi [1:24] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ TRUE :'data.frame':  17 obs. of  3 variables:
  ..$ name       : chr [1:17] "HD 179079 b" "HD 47186 c" "HD 93083 b" "HD 200964 b" ...
  ..$ url_checked: chr [1:17] "https://en.wikipedia.org/wiki/HD 179079 b" "https://en.wikipedia.org/wiki/HD 47186 c" "https://en.wikipedia.org/wiki/HD 93083 b" "https://en.wikipedia.org/wiki/HD 200964 b" ...
  ..$ url_valid  : logi [1:17] TRUE TRUE TRUE TRUE TRUE TRUE ...

显然列表的第二项包含具有有效 url 的数据框,因此将优先函数应用于该列表中的 url 列.请注意,为了解释的目的,我对所有行星的表格进行了采样……有 2400 个奇怪的名字,因此在您的情况下,检查将需要一两分钟的时间.希望你能把它包起来.

Obviously the second item of the list contains the data frame with valid urls, so apply the prior function to the url column in that one. Note that I sampled the table of all planets for purposes of explanation...There are 2400 some-odd names, so that check will take a min or two to run in your case. Hope that wraps it up for you.

这篇关于循环抓取,避免 404 错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆