如何在 R 中使用 follow_link 抓取此链接? [英] How to scrape this links with follow_link in R?

查看:47
本文介绍了如何在 R 中使用 follow_link 抓取此链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习如何使用 R 进行网页抓取.在这种情况下,我使用的是rvest"包和一个名为 follow_link 的特定函数.

I'm learning how to do web scraping with R. In this case i'm using the package "rvest" and a particular function called follow_link.

这个想法是获取具有多个链接的网页的信息.我希望我的代码输入这些链接并获取其中的表格.

The idea is to get the information of a webpage that has multiple links. I want my code to enter in those links and get the table that is in it.

这是代码:

library(rvest)
s <- html_session("http://fccee.uvigo.es/es/profesorado.html")
link <- c("Dereito Privado", "Economia Financieira e Contabilidade", "Matemáticas",
      "Estadística e Investigación Operativa", "Economía Aplicada", "Fundamentos da Análise Ec. e Hª e Institucións Económicas",
      "Informática", "Organización de Empresas e Marketing", "Socioloxía, Ciencia Política e da Administración e Filosofía")
n <- length(link) #number of pages
datos <- list()
for (i in 1:n){

    s <- s %>% follow_link(link[i])
    datos[[(i)]] <- s %>% html_nodes(".lista_fccee") %>% html_table()
    s <- s %>% back()}

问题是我收到此错误:没有链接包含文本Matemáticas".我认为问题与文本重音标记有关,因为前两个链接没有问题.

The problem is that I get this error: No links have text 'Matemáticas'. I believe the problem is related with the text accent mark, because the first two links go through with no problem.

这可能是一个非常基本的问题,但我没有找到有关此特定错误的任何信息.

This may be a very basic question but I didn't find any info on this particular error.

先谢谢你!

推荐答案

正如您所怀疑的那样,问题在于特殊字符(带重音的 a).您可以使用以下代码查看 R 如何查看链接名称:

The problem is, as you suspect, with the special character (the accented a). You can see how R views the link names with this code:

library(rvest)
top_url = "http://fccee.uvigo.es/es/profesorado.html"
page = read_html(top_url)
links = page %>% html_nodes("a") %>% html_text()
links
#> ...
#> [44] "Matemáticas"
#> ...

这最终成为一个复杂的编码问题,我不知道如何处理.因此,这里提供了另一种获取数据的方法.

This ends up being a complicated encoding issue which I can't figure out how to deal with. So, instead, here's an alternate way to get your data.

library(rvest)
top_url = "http://fccee.uvigo.es/es/profesorado.html"
page = read_html(top_url)
links = page %>% 
  html_nodes(".listado_fccee li a") %>% 
  html_attr("href")
datos <- list()
for(i in links){
  datos[[length(datos)+1]] <- i %>% 
  paste0("http://fccee.uvigo.es",.) %>%
  read_html() %>%
  html_nodes(".lista_fccee") %>% 
  html_table()
}

您没有使用会话,而是在第一页中阅读,从具有部门链接的 div 类 listado_fccee 中提取所有链接.然后,您像以前一样阅读每个链接并获取表格,将它们添加到您的列表中.

Instead of using a session, you read in the first page, extract all the links from the div class listado_fccee which has the department links. You then read each link and fetch the table as you did before, adding them to your list.

这篇关于如何在 R 中使用 follow_link 抓取此链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆