使用 rvest 抓取第一类节点而不是子节点 [英] Scrape first class node but not child using rvest

查看:49
本文介绍了使用 rvest 抓取第一类节点而不是子节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对此有很多问题,但看不到我正在寻找的答案.

many questions on this but couldn't see the answer I'm looking for.

希望提取特定文本,使用类 .quoteText 与我的代码一起工作,但也提取 .quoteText 中的所有子节点:

Looking to extract a specific text, with a class .quoteText which with my code works, but also extracts all of the child nodes within .quoteText:

url <- "https://www.goodreads.com/quotes/search?page=1&q=simone+de+beauvoir&utf8=%E2%9C%93"

quote_text <- function(html){

  path <- read_html(html)

  path %>% 
    html_nodes(".quoteText") %>%
    html_text(trim = TRUE) %>% 
    str_trim(side = "both") %>% 
    unlist()
}

quote_text(url)

结果包含文本,还有每个子节点!

with the result containing the text, but also every child node!

这是检查器工具带来的.我正在寻找的是突出显示的行,而不是同一代码下的子行.

This is what the inspector tool brings up. What I'm looking for is the highlighted line, but not the sub-lines under the same code.

一定有办法只刮那条线,不是吗?或者我是否需要收集该行,然后使用 str_extract/regex 删除其余部分?

There must be a way to scrape only that line, no? Or will I need to collect that line, and remove the rest with a str_extract / regex?

推荐答案

看起来 CSS 选择器不支持仅获取所选节点的直接文本,但 xpath 支持.我们可以调整您的功能以仅提取文本

It doesn't look like the CSS selectors support just getting the immediate text of the selected node, but xpath does. We can adjust your function to just extract the text with

quote_text <- function(html){

  path <- read_html(html)

  path %>% 
    html_nodes(xpath=paste(selectr::css_to_xpath(".quoteText"), "/text()") %>%
    html_text(trim = TRUE) %>% 
    str_trim(side = "both") %>% 
    unlist()
}

我将 CSS 选择器转换为 xpath 选择器,然后附加/text()"以获取元素的文本节点.

I convert the CSS selector to an xpath one and then append "/text()" to just get the text nodes of the elements.

这篇关于使用 rvest 抓取第一类节点而不是子节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆