如何使用 r 背心从网站上抓取所有页面 (1,2,3,.....n) [英] how to scrape all pages (1,2,3,.....n) from a website using r vest
本文介绍了如何使用 r 背心从网站上抓取所有页面 (1,2,3,.....n)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
# 我想读取 .html 文件列表以提取数据.感谢您的帮助.
# I would like to read the list of .html files to extract data. Appreciate your help.
library(rvest)
library(XML)
library(stringr)
library(data.table)
library(RCurl)
u0 <- "https://www.r-users.com/jobs/"
u1 <- read_html("https://www.r-users.com/jobs/")
download_folder <- ("C:/R/BNB/")
pages <- html_text(html_node(u1, ".results_count"))
Total_Pages <- substr(pages, 4, 7)
TP <- as.numeric(Total_Pages)
# reading first two pages, writing them as separate .html files
for (i in 1:TP) {
url <- paste(u0, "page=/", i, sep = "")
download.file(url, paste(download_folder, i, ".html", sep = ""))
#create html object
html <- html(paste(download_folder, i, ".html", sep = ""))
}
推荐答案
这是一个潜在的解决方案:
Here is a potential solution:
library(rvest)
library(stringr)
u0 <- "https://www.r-users.com/jobs/"
u1 <- read_html("https://www.r-users.com/jobs/")
download_folder <- getwd() #note change in output directory
TP<-max(as.integer(html_text(html_nodes(u1,"a.page-numbers"))), na.rm=TRUE)
# reading first two pages, writing them as separate .html files
for (i in 1:TP ) {
url <- paste(u0,"page/",i, "/", sep="")
print(url)
download.file(url,paste(download_folder,i,".html",sep=""))
#create html object
html <- read_html(paste(download_folder,i,".html",sep=""))
}
我在 html 中找不到 .result-count
类,所以我寻找 page-numbers 类并选择最高的返回值.此外,函数 html
已被弃用,因此我将其替换为 read_html
.祝你好运
I could not find the class .result-count
in the html, so instead I looked for the page-numbers class and pick the highest returned value.
Also, the function html
is deprecated thus I replaced it with read_html
.
Good luck
这篇关于如何使用 r 背心从网站上抓取所有页面 (1,2,3,.....n)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文