如何在 R 中使用 read_html 循环浏览多个网站? [英] How to loop through multiple websites using read_html in R?

查看:54
本文介绍了如何在 R 中使用 read_html 循环浏览多个网站?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在创建循环到 read_html 并提取我需要的信息时遇到问题.我能够创建一个循环来从一个网站中提取数据.

I'm having trouble creating a loop to read_html and extract the information I needed. I was able to create a loop to extract from one website.

例如:下面是我从亚马逊网站提取标题、描述和关键字的代码.

For example: Below is my code to extract title, description, and keywords from Amazon website.

URL <- read_html("http://www.amazon.com")
library(rvest)
results <- URL %>% html_nodes("head")

library(dplyr)
records <- vector("list", length = length(results))

for (i in seq_along(records)) {
  title <- xml_contents(results[i] %>% html_nodes("title"))[1] %>% html_text(trim = TRUE)
  description <- html_nodes(results[i], "meta[name=description]") %>% html_attr("content")
  keywords <- html_nodes(results[i], "meta[name=keywords]") %>% html_attr("content")
  records[[i]] <- data.frame(title = title, description = description, keywords = keywords)
}

但是,如果我有:

name <- c("amazon", "apple", "usps")
url <- c("http://www.apple.com,
             "http://www.amazon.com",
             "http://www.usps.com")
    webpages <- data.frame(name, url)

如何将 read_html 包含到我创建的现有循环中,以提取我想要的信息并包含 URL 名称.

How could I include read_html into the existing loop which I created to extract those information I want and also include the URL name.

期望输出示例

url                      title            description               keywords
http://www.apple.com     Apple    Apple's website description     Apple, iPhone, iPad
http://www.amazon.com    Amazon   Amazon's website description    Shopping, Home, Online
http://www.usps.com      USPS     USPS's website description      Shipping, Postage, Stamps

谢谢大家的建议.

推荐答案

这样的事情可能对你有用.

Something like this may work for you.

library(rvest)
library(dplyr)

webpages <- data.frame(name = c("amazon", "apple", "usps"),
                        url = c("http://www.amazon.com",
                                "http://www.apple.com",
                                "http://www.usps.com"))


webpages <- apply(webpages, 1, function(x){
  URL <- read_html(x['url'])

  results <- URL %>% html_nodes("head")

  records <- vector("list", length = length(results))

  for (i in seq_along(records)) {
    title <- xml_contents(results[i] %>% html_nodes("title"))[1] %>% html_text(trim = TRUE)
    desc <- html_nodes(results[i], "meta[name=description]") %>% html_attr("content")
    kw <- html_nodes(results[i], "meta[name=keywords]") %>% html_attr("content")
  }

  return(data.frame(name = x['name'],
                    url = x['url'],
                    title = ifelse(length(title) > 0, title, NA),
                    description = ifelse(length(desc) > 0, desc, NA),
                    kewords = ifelse(length(kw) > 0, kw, NA)))
})

webpages <- do.call(rbind, webpages)

这篇关于如何在 R 中使用 read_html 循环浏览多个网站?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆