不同数量的节点 [英] Different number of nodes

查看：31 发布时间：2021/9/24 18:57:20 r web-scraping

本文介绍了不同数量的节点的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从 airlinequality.com 页面获取一些航空公司评论，该页面提供了有关不同航班方面的信息.撰写航班评论时，并非所有字段都是必填字段.这会创建结构，当不同的评论具有不同数量的元素时，我当前的代码无法处理.

I want to get some airline reviews from airlinequality.com page, where information about different flight aspects are available. When writing flight review, not all fields are mandatory. This creates structure, when different reviews have different number of elements, which my current code can't handle.

例如，我想从这个页面获得评论:http://www.airlinequality.com/airline-reviews/austrian-airlines/page/1/

For example, I want to get reviews from this page: http://www.airlinequality.com/airline-reviews/austrian-airlines/page/1/

Seat Comfort 有 10 条评论，但 Inflight Entertainment 仅提供 8 条.最终，这会创建两个长度不同的向量，无法合并.

There are 10 reviews for Seat Comfort, but Inflight Entertainment is available only inf 8. In the end, this creates two vectors of different length, which can't be merged.

我的代码:

review_html_temp = read_html("http://www.airlinequality.com/airline-reviews/austrian-airlines/page/1/)

    review_seat_comfort = review_html_temp %>%
  html_nodes(xpath = './/table[@class = "review-ratings"]//td[@class = "review-rating-header seat_comfort"]/following-sibling::td/span[@class = "star fill"][last()]') %>%
  html_text() %>%
  str_replace_all(pattern = "[\r\n\t]" , "")

review_entertainment = review_html_temp %>%
  html_nodes(xpath = './/table[@class = "review-ratings"]//td[@class = "review-rating-header inflight_entertainment"]/following-sibling::td//span[@class = "star fill"][last()]') %>%
  html_text() %>%
  str_replace_all(pattern = "[\r\n\t]" , "")

有没有办法，当所有 10 条评论都不存在节点时，我如何用"或 NA 填充娱乐价值?最终结果如下:

Is there way, how I can fill entertainment value with " " or NA, when node is not present for all 10 reviews? Final results would look like:

seat_comfort: "4" "5" "3" "3" "1" "4" "4" "3" "3" "3"
entertainment_system: "5" "1" NA "1" "1" "3" NA "3" "5" "1"

推荐答案

关键是 html_nodes(...) %>% html_node(...) 会返回一个条目 对应html_nodes返回的每个节点如果指定给html_node的路径是绝对.IIUC 这意味着 html_node 将每个返回的节点视为自己的根，并为每个根返回一个唯一节点(特别是为后续调用不匹配的节点返回 NA)；使用 // 开始 html_node 调用 重置搜索 并将根返回到整个页面根.我不是 100% 确定这种解释，但实际上这意味着以下可以工作(注意:我必须将页面下载为 HTML，因为该站点动态加载(至少对我而言)并且不是通过简单的<代码>read_html).

The key is that html_nodes(...) %>% html_node(...) will return an entry corresponding to each node returned by html_nodes if the path specified to html_node is absolute. IIUC this means html_node treats each returned node as its own root and returns a unique node for each root (in particular returning NA for nodes where the subsequent call goes unmatched); starting the html_node call with // resets the search and returns the root to the overall page root. I'm not 100% sure of this interpretation, but in practice it means the following can work (NB: I had to download the page as HTML since the site loads dynamically (for me at least) and isn't read by simple read_html).

URL = '~/Desktop/airlines.html'
#get to table; we end at tbody here instead of tr
#  since we only want one entry for each "table" on the
#  page (i.e., for each review); if we add tr there,
#  the html_nodes call will give us an element for
#  _each row of each table_.
tbl = read_html(URL) %>% 
  html_nodes(xpath = '//table[@class="review-ratings"]/tbody')
#note the %s where we'll substitute the particular element we want
star_xp = paste0('tr/td[@class="%s"]/following-sibling::',
                 'td[@class="review-rating-stars stars"]',
                 '/span[@class="star fill"][last()]') 

tbl %>% 
  html_node(xpath = sprintf(star_xp, "review-rating-header seat_comfort")) %>% 
  html_text
#  [1] NA  "4" "5" "3" "3" "1" "4" "4" "3" "3" "3"

这很丑陋，但遵循我习惯看到的提取流程.我想以下内容会更maggrittr-y/easy on the eyes，虽然有点非线性:

This is pretty ugly, but follows the flow of extractions I'm accustomed to seeing. I guess the following would be more maggrittr-y/easy on the eyes, though a bit nonlinear:

star_xp %>% sprintf("review-rating-header seat_comfort") %>%
  html_node(x = tbl, xpath = .) %>% html_text
#  [1] NA  "4" "5" "3" "3" "1" "4" "4" "3" "3" "3"

对于另一个:

star_xp %>% sprintf("review-rating-header inflight_entertainment") %>%
  html_node(x = tbl, xpath = .) %>% html_text
#  [1] NA  NA  "5" "1" "1" "1" "3" "3" "5" NA  "1"

这篇关于不同数量的节点的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

不同数量的节点 [英] Different number of nodes

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

不同数量的节点 [英] Different number of nodes

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭