使用 rvest 抓取不在表中的数据 [英] Using rvest to scrape data that is not in table

查看:43
本文介绍了使用 rvest 抓取不在表中的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从网站上抓取一些数据.我以为我可以使用 rvest,但是我在获取不在表中的数据时遇到了很多麻烦.

我不知道是否可行,或者我是否使用了错误的软件包?

我正在尝试从以下 html 中获取网站、名称和地址:

<i class="sprite icon title"></i><p class="title"><a target="_blank";href=https://test.com/regions/Tennis_Court.html">网球场</a></p><p类=位置">123 Page St, Charlestown</p><p class="摘录";itemprop="description">打网球的地方</p>

我希望我可以使用诸如 html_node("title") 之类的东西,但这似乎没有错.我是不是完全走错了路?

解决方案

您可以使用 html_nodes 添加 css 选择器来提取:

库(rvest)网址 <- 'https://concreteplayground.com/auckland/bars'网页 <- url %>% read_html()名称 <- 网页 %>% html_nodes('p.name a') %>%html_text() %>% trimws()地址 <- 网页 %>% html_nodes('p.address') %>% html_text() %>%trimws()链接 <- 网页 %>% html_nodes('p.name a') %>% html_attr('href')data.frame(名称,地址,链接)#姓名地址#1 Holy Hop 498 New North Road, Kingsland#2 Sly 354A Karangahape Road, Newton#...#...# 链接#1 https://concreteplayground.com/auckland/bars/holy-hop#2 https://concreteplayground.com/auckland/bars/sly#...#...

I'm trying to scrape some data from a website. I thought I could use rvest, but I'm having a lot of trouble getting data that is not in a table.

I don't know if it's possible, or whether I'm using the wrong package?

I am trying to get the website, name and address from the following html:

<div class="info clearfix">
<i class="sprite icon title"></i>
<p class="title">
<a target="_blank" href="https://test.com/regions/Tennis_Court.html">
Tennis Court</a>
</p>
<p class="location"> 123 Page St, Charlestown</p>                                                <p class="excerpt" itemprop="description">A place to play tennis</p>                                                                                           </div>

I'd hoped I could use something like html_node("title") etc, but that doesn't seem to wrong. Am I completely on the wrong path?

解决方案

You can use html_nodes to add css selectors to extract :

library(rvest)
url <- 'https://concreteplayground.com/auckland/bars'

webpage <- url %>% read_html()
name <- webpage %>% html_nodes('p.name a') %>%html_text() %>% trimws()
address <- webpage %>% html_nodes('p.address') %>% html_text() %>% trimws()
links <- webpage %>% html_nodes('p.name a') %>% html_attr('href')
data.frame(name, address, links)

#                              name                                address
#1                         Holy Hop          498 New North Road, Kingsland
#2                              Sly          354A Karangahape Road, Newton
#...
#...

                                                                      
#                                                                 links
#1                         https://concreteplayground.com/auckland/bars/holy-hop
#2                              https://concreteplayground.com/auckland/bars/sly
#...
#...

这篇关于使用 rvest 抓取不在表中的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆