抓取单个节点,排除同类别的其他节点 [英] Scrape single node excluding others in same category

查看:37
本文介绍了抓取单个节点,排除同类别的其他节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

建立在

这个简单的代码片段有效:

库(rvest)url <- "https://www.goodreads.com/quotes/search?page=1&q=simone+de+beauvoir&utf8=%E2%9C%93"路径 <- read_html(url)路径%>%html_nodes("a.smallText") %>%html_text(trim = TRUE)#【1】2492个赞"2265个赞"2168个赞"2003个赞"1774个赞"1060个赞"580个赞"#【8】523个赞"482个赞"403个赞"383个赞"372个赞"360个赞"347个赞"#【15】330个赞"329个赞"318个赞"317个赞"310个赞"281个赞"

Building off this question, I'm looking to extract a single node ("likes") from the smallText node, but ignoring others. The node I'm looking for is a.SmallText, so need to select only that one.

code:

url <- "https://www.goodreads.com/quotes/search?page=1&q=simone+de+beauvoir&utf8=%E2%9C%93"

quote_rating <- function(html){

  path <- read_html(html)

  path %>% 
    html_nodes(xpath = paste(selectr::css_to_xpath(".smallText"), "/text()"))%>%
    html_text(trim = TRUE) %>% 
    str_trim(side = "both") %>% 
    enframe(name = NULL)
}

quote_rating(url)

Which gives a result:

# A tibble: 80 x 1
   value              
   <chr>              
 1 Showing 1-20 of 790
 2 (0.03 seconds)     
 3 tags:              
 4 ""                 
 5 2492 likes         
 6 2265 likes         
 7 tags:              
 8 ,                  
 9 ,                  
10 ,                  
# ... with 70 more rows

Add a html_nodes("a.smallText") filters too much:

quote_rating <- function(html){

  path <- read_html(html) 

  path %>% 
    html_nodes(xpath = paste(selectr::css_to_xpath(".smallText"), "/text()")) %>%
    html_nodes("a.smallText") %>% 
    html_text(trim = TRUE) %>%
    str_trim(side = "both") %>% 
    enframe(name = NULL)

}

# A tibble: 0 x 1
# ... with 1 variable: value <chr>
> 

解决方案

To extract the number of likes for each quote. One can perform the filtering using just the css selectors, one want to look for the a tags with class=smallText.

This simple code fragment works:

library(rvest)
url <- "https://www.goodreads.com/quotes/search?page=1&q=simone+de+beauvoir&utf8=%E2%9C%93"

path <- read_html(url) 

path %>% 
    html_nodes("a.smallText") %>% 
    html_text(trim = TRUE)

# [1] "2492 likes" "2265 likes" "2168 likes" "2003 likes" "1774 likes" "1060 likes" "580 likes" 
# [8] "523 likes"  "482 likes"  "403 likes"  "383 likes"  "372 likes"  "360 likes"  "347 likes" 
# [15] "330 likes"  "329 likes"  "318 likes"  "317 likes"  "310 likes"  "281 likes" 

这篇关于抓取单个节点,排除同类别的其他节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆