rvest丢失的节点->不适用 [英] rvest missing nodes --> NA

查看:96
本文介绍了rvest丢失的节点->不适用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用R中的rvest在html文档中搜索节点.在下面的代码中,我想知道当缺少"s_BadgeTop *"时如何返回NULL或NA.这只是出于学术目的.

I'm trying to search for nodes in an html document using rvest in R. In the code below, I would like to know how return a NULL or NA when "s_BadgeTop*" is missing. It is only for academic purpose.

<div style="margin-bottom:0.5em;"><div><div style="float:left;">Por&nbsp;</div><div style="float:left;"><a href="/gp/pdp/profile/XXX" ><span style = "font-weight: bold;">JOHN</span></a> (UK)  - <a href="/gp/cdp/member-reviews/XXX">Ver todas las opiniones</a><br /><span class="cmtySprite s_BadgeTop1000 " ><span>(TOP 1000 COMENTARISTAS)</span></span></div></div></div>

<div style="margin-bottom:0.5em;"><div><div style="float:left;">Por&nbsp;</div><div style="float:left;"><a href="/gp/pdp/profile/YYY" ><span style = "font-weight: bold;">MARY</span></a> (USA)  - <a href="/gp/cdp/member-reviews/YYY">Ver todas las opiniones</a><br /></div></div></div>

<div style="margin-bottom:0.5em;"><div><div style="float:left;">Por&nbsp;</div><div style="float:left;"><a href="/gp/pdp/profile/ZZZ" ><span style = "font-weight: bold;">CANDICE</span></a> (UK)  - <a href="/gp/cdp/member-reviews/ZZZ">Ver todas las opiniones</a><br /><span class="cmtySprite s_BadgeTop500 " ><span>(TOP 500 COMENTARISTAS)</span></span></div></div></div>

我需要一个具有以下结构的data.frame:

I need a data.frame with this structure:

  1. 约翰(前1000名评论家)
  2. 马里兰州
  3. CANDICE(前500名评论家)

我已经尝试过以下代码:

I have tried this code:

name <- pg %>%
html_nodes(xpath='//a[contains(@href,"/gp/pdp/profile/")]') %>%
html_text

status <- pg %>%
html_nodes(xpath='//span[contains(@class,"cmtySprite s_BadgeTop")]')  %>% 
html_text
status[is.na(status)] <- "NA"

但是status [is.na(status)]<-"NA"不起作用.

but status[is.na(status)] <- "NA" does not work.

我得到以下输出:

  1. 约翰(前1000名评论家)
  2. 玛丽(前500名评论)
  3. CANDICE(前1000名Comentaristas)

谢谢!

推荐答案

您可以遍历三个条目中的每个条目,从中提取名称和(可能是徽章),最终合并所有结果.

You can iterate over each of the three entries, extract name and - potentially the badge - from it, and ultimately merge all your results.

示例:

# For rbindlist
library(data.table)

# Function to parse a particular 'div' and extract name and (potentially) badge
parse_node <- function(node) {
  name <- node %>% 
    html_node('a[href^="/gp/pdp/profile"]') %>%
    html_text
  badge <- node %>%
    html_nodes('span[class*="s_BadgeTop"] span') %>%
    html_text
  list(name=name[1],badge=badge[1])
}

# extract nodes, parse and merge
pg %>%  
  html_nodes('div[style^="margin-bottom"] div div[style^=float]:nth-child(2)') %>%
  lapply(parse_node) %>%
  rbindlist

这篇关于rvest丢失的节点-&gt;不适用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆