使用rvest阅读html.如何检查CSS选择器类是否包含任何内容? [英] Reading in html with R rvest. How do I check if a CSS selector class contains anything?

查看:120
本文介绍了使用rvest阅读html.如何检查CSS选择器类是否包含任何内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我第一次尝试处理HTML和CSS选择器.我正在使用R软件包rvest废弃Billboard Top 100网站.我感兴趣的一些数据包括本周的排名,歌曲,天气或歌曲是否是New,以及天气或歌曲是否有任何奖项.

this is my first attempt to deal with HTML and CSS selectors. I am using the R package rvest to scrap the Billboard Top 100 website. Some of the data that I am interested in include this weeks rank, song, weather or not the song is New, and weather or not the song has any awards.

我可以通过以下方式获得歌曲名称和排名:

I am able to get the song name and rank with the following:

library(rvest)
URL <- "http://www.billboard.com/charts/hot-100/2017-09-30"

webpage <- read_html(URL)
current_week_rank <- html_nodes(webpage, '.chart-row__current-week')
current_week_rank <- as.numeric(html_text(current_week_rank))

我的问题来自于新指标和奖励指标.歌曲以行列出,每100首包含在其中:

My problem comes with the new and award indicators. The songs are listed in rows with each of the 100 contained in:

<article> class="chart-row char-row--1 js chart-row" ....
</article>

如果一首歌是新歌,它将在其中包含类,例如:

If a song is new, this will have class within it like:

<div class="chart-row__new-indicator">

如果某首歌获得大奖,则其中将包含此类:

If a song has an award, there will be this class within it:

<div class="chart-row__award-indicator">

有没有一种方法可以查看class ="chart-row char-row--1 js chart-row"的所有100个实例,并查看其中是否存在这些实例?我从current_week_rank获得的输出是一列100个值.我希望有一种方法可以做到这一点,以便对每首歌曲都有一个观察结果.

Is there a way that I can look at all 100 instances of the class="chart-row char-row--1 js chart-row" ... and see if either of these exist within it? The output that I get from the current_week_rank is one column of 100 values. I am hoping that there is a way to get this so that I have one observation for each song.

感谢您的帮助或建议.

推荐答案

基本上相当于我上面指出的Q& A的定制版本.我不确定100%是否确定or是否按预期工作,因为示例页面中只有一行带有<div class="chart-row__new-indicator">,并且该行也恰好具有<div class="chart-row__award-indicator">标记.

Basically amounts to a tailored version of the Q&A I indicated above. I can't tell for 100% certain whether the or is working as intended, since there's only one row in your example page with a <div class="chart-row__new-indicator">, and that row also happens to have a <div class="chart-row__award-indicator"> tag as well.

#xpath to focus on the 100 rows of interest
primary_xp = '//div[@class="chart-row__primary"]'
#xpath which subselects rows you're after
check_xp = paste('div[@class="chart-row__award-indicator" or' ,
                     '@class="chart-row__new-indicator"]')

webpage %>% html_nodes(xpath = primary_xp) %>% 
  #row__primary for which there are no such child nodes
  #  will come back NA, and hence so will html_attr('class')
  html_node(xpath = check_xp) %>% 
  #! is a bit extraneous, as it only flips FALSE to TRUE
  #  for the rows you're after (necessity depends on
  #  particulars of your application)
  html_attr('class') %>% is.na %>% `!`

FWIW,您可能可以将check_xp缩短为以下内容:

FWIW, you may be able to shorten check_xp to the following:

check_xp = 'div[contains(@class, "indicator")]'

当然可以覆盖"chart-row__award-indicator""chart-row__new-indicator",但是如果存在这样的替代标记,则还会用包含"indicator"class包裹其他节点(您必须自己确定)

Which certainly covers both "chart-row__award-indicator" and "chart-row__new-indicator", but would also wrap up other nodes with a class containing "indicator", if such an alternative tag exists (you'll have to determine this for yourself)

这篇关于使用rvest阅读html.如何检查CSS选择器类是否包含任何内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆