如何在 R 中使用 read_html 循环浏览多个网站? [英] How to loop through multiple websites using read_html in R?
本文介绍了如何在 R 中使用 read_html 循环浏览多个网站?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我在创建循环到 read_html
并提取我需要的信息时遇到问题.我能够创建一个循环来从一个网站中提取数据.
I'm having trouble creating a loop to read_html
and extract the information I needed. I was able to create a loop to extract from one website.
例如:下面是我从亚马逊网站提取标题、描述和关键字的代码.
For example: Below is my code to extract title, description, and keywords from Amazon website.
URL <- read_html("http://www.amazon.com")
library(rvest)
results <- URL %>% html_nodes("head")
library(dplyr)
records <- vector("list", length = length(results))
for (i in seq_along(records)) {
title <- xml_contents(results[i] %>% html_nodes("title"))[1] %>% html_text(trim = TRUE)
description <- html_nodes(results[i], "meta[name=description]") %>% html_attr("content")
keywords <- html_nodes(results[i], "meta[name=keywords]") %>% html_attr("content")
records[[i]] <- data.frame(title = title, description = description, keywords = keywords)
}
但是,如果我有:
name <- c("amazon", "apple", "usps")
url <- c("http://www.apple.com,
"http://www.amazon.com",
"http://www.usps.com")
webpages <- data.frame(name, url)
如何将 read_html
包含到我创建的现有循环中,以提取我想要的信息并包含 URL 名称.
How could I include read_html
into the existing loop which I created to extract those information I want and also include the URL name.
期望输出示例
url title description keywords
http://www.apple.com Apple Apple's website description Apple, iPhone, iPad
http://www.amazon.com Amazon Amazon's website description Shopping, Home, Online
http://www.usps.com USPS USPS's website description Shipping, Postage, Stamps
谢谢大家的建议.
推荐答案
这样的事情可能对你有用.
Something like this may work for you.
library(rvest)
library(dplyr)
webpages <- data.frame(name = c("amazon", "apple", "usps"),
url = c("http://www.amazon.com",
"http://www.apple.com",
"http://www.usps.com"))
webpages <- apply(webpages, 1, function(x){
URL <- read_html(x['url'])
results <- URL %>% html_nodes("head")
records <- vector("list", length = length(results))
for (i in seq_along(records)) {
title <- xml_contents(results[i] %>% html_nodes("title"))[1] %>% html_text(trim = TRUE)
desc <- html_nodes(results[i], "meta[name=description]") %>% html_attr("content")
kw <- html_nodes(results[i], "meta[name=keywords]") %>% html_attr("content")
}
return(data.frame(name = x['name'],
url = x['url'],
title = ifelse(length(title) > 0, title, NA),
description = ifelse(length(desc) > 0, desc, NA),
kewords = ifelse(length(kw) > 0, kw, NA)))
})
webpages <- do.call(rbind, webpages)
这篇关于如何在 R 中使用 read_html 循环浏览多个网站?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文