错误:当前工作目录中不存在“NA"(Webscraping) [英] Error: 'NA' does not exist in current working directory (Webscraping)

查看:79
本文介绍了错误:当前工作目录中不存在“NA"(Webscraping)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从以下网址抓取数据:https://university.careers360.com/colleges/list-印度学位学院我想点击每所大学的名称并获取每所大学的特定数据.

首先我做的是在一个向量中收集所有大学网址-:

#加载包:图书馆(xml2)图书馆(rvest)图书馆(字符串)图书馆(dplyr)#指定要报废的网站的urlbaseurl <-https://university.careers360.com/colleges/list-of-degree-colleges-in-India"#从亚马逊读取html内容basewebpage <- read_html(baseurl)#提取学院名称及其网址剪贴链接 <- 功能(网址){#从url创建一个html文档网页 <- xml2::read_html(url)#提取网址url_ <- 网页 %>%rvest::html_nodes(".title a") %>%rvest::html_attr("href")#提取链接文本链接_ <- 网页%>%rvest::html_nodes(".title a") %>%rvest::html_text()返回(数据帧(链接=链接_,网址=网址_))}#学院名称和网址allcollegeurls<-scraplinks(baseurl)

到目前为止工作正常,但是当我对每个 url 使用 read_html 时,它显示一个错误.

#读取每个urlfor (i in allcollegeurls$url) {clgwebpage <- read_html(allcollegeurls$url[i])}

<块引用>

错误:'NA' 在当前工作目录 ('C:/Users/User/Documents') 中不存在.

我什至使用了'break'命令,但仍然是同样的错误-:

#读取每个urlfor (i in allcollegeurls$url) {clgwebpage <- read_html(allcollegeurls$url[i])if(is.na(allcollegeurls$url[i]))break}

请帮忙.

按要求发布所有大学网址的 str-:

<代码>>str(allcollegeurls)tbl_df"、tbl"和data.frame"类:30 obs.2个变量:$ 链接:chr Netaji Subhas Institute of Technology,德里"Hansraj德里学院" 石油与能源大学商学院研究,D.."德里印度教学院"......$ url : chr "https://www.careers360.com/university/netaji-subhas-新德里科技大学"https://www.careers360.com/colleges/hansraj-college-delhi"https://www.careers360.com/colleges/school-of-business-university-of-石油和能源研究-德拉敦"https://www.careers360.com/colleges/hindu-college-delhi"...

解决方案

这项工作,

purrr::map(allcollegeurls$url, read_html)

map 函数:map 函数通过对每个元素应用一个函数并返回一个与输入长度相同的向量来转换它们的输入.我喜欢避免在 R 中使用 for.

I am trying to web-scrape data from the following url-: https://university.careers360.com/colleges/list-of-degree-colleges-in-India I want to click on each college name and get particular data for each college.

First what I did was to collect all the college urls in a vector-:

#loading the package:
library(xml2)
library(rvest)
library(stringr)
library(dplyr)

#Specifying the url for desired website to be scrapped
baseurl <- "https://university.careers360.com/colleges/list-of-degree-colleges-in-India"

#Reading the html content from Amazon
basewebpage <- read_html(baseurl)

#Extracting college name and its url
scraplinks <- function(url){
   #Create an html document from the url
   webpage <- xml2::read_html(url)
   #Extract the URLs
   url_ <- webpage %>%
   rvest::html_nodes(".title a") %>%
   rvest::html_attr("href")  
   #Extract the link text
   link_ <- webpage %>%
   rvest::html_nodes(".title a") %>%
   rvest::html_text()
   return(data_frame(link = link_, url = url_))
}

#College names and Urls
allcollegeurls<-scraplinks(baseurl)

Working fine uptill now, but when I use read_html for each url, it is showing an error.

#Reading the each url
for (i in allcollegeurls$url) {
  clgwebpage <- read_html(allcollegeurls$url[i])
}

Error: 'NA' does not exist in current working directory ('C:/Users/User/Documents').

I even used 'break' command but still same error-:

#Reading the each url
for (i in allcollegeurls$url) {
  clgwebpage <- read_html(allcollegeurls$url[i])
  if(is.na(allcollegeurls$url[i]))break
}

Please help.

Posting str of allcollegeurls as requested-:

> str(allcollegeurls)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   30 obs. of  2 variables:
 $ link: chr  "Netaji Subhas Institute of Technology, Delhi" "Hansraj 
College, Delhi" "School of Business, University of Petroleum and Energy 
Studies, D.." "Hindu College, Delhi" ...
 $ url : chr  "https://www.careers360.com/university/netaji-subhas- 
 university-of-technology-new-delhi" 
"https://www.careers360.com/colleges/hansraj-college-delhi" 
"https://www.careers360.com/colleges/school-of-business-university-of- 
 petroleum-and-energy-studies-dehradun" 
"https://www.careers360.com/colleges/hindu-college-delhi" ...

解决方案

This work,

purrr::map(allcollegeurls$url, read_html)

map function: The map functions transform their input by applying a function to each element and returning a vector the same length as the input. I love to avoid for use in R.

这篇关于错误:当前工作目录中不存在“NA"(Webscraping)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆