如何在R中没有网站无效的情况下进行网络抓取? [英] How can I web scraping without the problem of null website in R?

查看：60 发布时间：2021/4/29 18:44:34 r loops pagination data-science rvest

本文介绍了如何在R中没有网站无效的情况下进行网络抓取?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要提取有关物种的信息，然后编写以下代码.但是，我对某些缺少的物种有疑问.如何避免这个问题.

I need to extract information about species and I write the following code. However, I have a problem with some absent species. How is it possible to avoid this problem.

Q<-c("rvest","stringr","tidyverse","jsonlite")
lapply(Q,require,character.only=TRUE)

#This part was obtained by pagination that I not provided to have a short code
sp1<-as.matrix(c("https://www.gulfbase.org/species/Acanthilia-intermedia", "https://www.gulfbase.org/species/Achelous-floridanus",                                      "https://www.gulfbase.org/species/Achelous-ordwayi", "https://www.gulfbase.org/species/Achelous-spinicarpus","https://www.gulfbase.org/species/Achelous-spinimanus",                       
"https://www.gulfbase.org/species/Agolambrus-agonus",                         
"https://www.gulfbase.org/species/Agononida-longipes",                        
"https://www.gulfbase.org/species/Amphithrax-aculeatus",                      
"https://www.gulfbase.org/species/Anasimus-latus"))    
> sp1

GiveMeData<-function(url){ 
  sp1<-read_html(url)

  sp1selmax<-"#block-beaker-content > article > div > main > section.node--full__main > div.node--full__figures > div.figures--joined > div:nth-child(1)"
  Mindepth<-html_node(sp1,sp1selmax)
  mintext<-html_text(Mindepth)
  mintext

  sp1selmax<-"#block-beaker-content > article > div > main > section.node--full__main > div.node--full__figures > div.figures--joined > div:nth-child(2)"
  Maxdepth<-html_node(sp1,sp1selmax)
  maxtext<-html_text(Maxdepth)
  maxtext

  sp1seldist<-"#block-beaker-content > article > div > main > section.node--full__main > div.node--full__figures > div:nth-child(2) > div:nth-child(2) > div"
  Distr<-html_node(sp1,sp1seldist)
  distext<-html_text(Distr)
  distext

  sp1habitat<-"#block-beaker-content > article > div > main > section.node--full__main > div.node--full__figures > div:nth-child(3) > ul"
  Habit<-html_node(sp1,sp1habitat)
  habtext<-html_text(Habit)
  habtext

  sp1habitat2<-"#block-beaker-content > article > div > main > section.node--full__main > div.node--full__figures > div.field > ul > li"
  Habit2<-html_node(sp1,sp1habitat2)
  habtext2<-html_text(Habit2)
  habtext2

  sp1ref<-"#block-beaker-content > article > div > main > section.node--full__related"
  Ref<-html_node(sp1,sp1ref)
  reftext<-html_text(Ref)
  reftext

  mintext<-gsub("\n                  \n      Min Depth\n      \n                            \n                      ","",mintext)
  mintext<-gsub(" meters\n                  \n                    ","",mintext)
  maxtext<-gsub("\n                  \n      Max Depth\n      \n                            \n                      ","",maxtext)
  maxtext<-gsub(" meters\n                  \n","",maxtext)
  habtext<-gsub("\n",",",habtext)
  habtext<-gsub("\\s","",habtext)
  reftext<-gsub("\n\n",";",reftext)
  reftext<-gsub("\\s","",reftext)


  Info<-rbind(Info=c("Min", "Max", "Distribution", "Habitat", "MicroHabitat", "References"),Data=c(mintext,maxtext,distext,habtext,habtext2,reftext))
}

doit<-lapply(pag[1:10],GiveMeData)

问题是缺少物种.我尝试了一个小循环，但没有用.

The problem is the absent species. I tried with a small loop but I do not work.

如何在R中没有网站无效的情况下进行网络抓取? [英] How can I web scraping without the problem of null website in R?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在R中没有网站无效的情况下进行网络抓取? [英] How can I web scraping without the problem of null website in R?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭