如何在 R 中使用 follow_link 抓取此链接? [英] How to scrape this links with follow_link in R?
问题描述
我正在学习如何使用 R 进行网页抓取.在这种情况下,我使用的是rvest"包和一个名为 follow_link 的特定函数.
I'm learning how to do web scraping with R. In this case i'm using the package "rvest" and a particular function called follow_link.
这个想法是获取具有多个链接的网页的信息.我希望我的代码输入这些链接并获取其中的表格.
The idea is to get the information of a webpage that has multiple links. I want my code to enter in those links and get the table that is in it.
这是代码:
library(rvest)
s <- html_session("http://fccee.uvigo.es/es/profesorado.html")
link <- c("Dereito Privado", "Economia Financieira e Contabilidade", "Matemáticas",
"Estadística e Investigación Operativa", "Economía Aplicada", "Fundamentos da Análise Ec. e Hª e Institucións Económicas",
"Informática", "Organización de Empresas e Marketing", "Socioloxía, Ciencia Política e da Administración e Filosofía")
n <- length(link) #number of pages
datos <- list()
for (i in 1:n){
s <- s %>% follow_link(link[i])
datos[[(i)]] <- s %>% html_nodes(".lista_fccee") %>% html_table()
s <- s %>% back()}
问题是我收到此错误:没有链接包含文本Matemáticas".我认为问题与文本重音标记有关,因为前两个链接没有问题.
The problem is that I get this error: No links have text 'Matemáticas'. I believe the problem is related with the text accent mark, because the first two links go through with no problem.
这可能是一个非常基本的问题,但我没有找到有关此特定错误的任何信息.
This may be a very basic question but I didn't find any info on this particular error.
先谢谢你!
推荐答案
正如您所怀疑的那样,问题在于特殊字符(带重音的 a).您可以使用以下代码查看 R 如何查看链接名称:
The problem is, as you suspect, with the special character (the accented a). You can see how R views the link names with this code:
library(rvest)
top_url = "http://fccee.uvigo.es/es/profesorado.html"
page = read_html(top_url)
links = page %>% html_nodes("a") %>% html_text()
links
#> ...
#> [44] "Matemáticas"
#> ...
这最终成为一个复杂的编码问题,我不知道如何处理.因此,这里提供了另一种获取数据的方法.
This ends up being a complicated encoding issue which I can't figure out how to deal with. So, instead, here's an alternate way to get your data.
library(rvest)
top_url = "http://fccee.uvigo.es/es/profesorado.html"
page = read_html(top_url)
links = page %>%
html_nodes(".listado_fccee li a") %>%
html_attr("href")
datos <- list()
for(i in links){
datos[[length(datos)+1]] <- i %>%
paste0("http://fccee.uvigo.es",.) %>%
read_html() %>%
html_nodes(".lista_fccee") %>%
html_table()
}
您没有使用会话,而是在第一页中阅读,从具有部门链接的 div 类 listado_fccee
中提取所有链接.然后,您像以前一样阅读每个链接并获取表格,将它们添加到您的列表中.
Instead of using a session, you read in the first page, extract all the links from the div class listado_fccee
which has the department links. You then read each link and fetch the table as you did before, adding them to your list.
这篇关于如何在 R 中使用 follow_link 抓取此链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!