排除节点 RVest [英] Excluding Nodes RVest

查看:56
本文介绍了排除节点 RVest的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 RVest 抓取博客文本,并且正在努力找出一种排除特定节点的简单方法.以下拉取文字:

I am scraping blog text using RVest and am struggling to figure out a simple way to exclude specific nodes. The following pulls the text:

AllandSundry_test <- read_html
("http://www.sundrymourning.com/2017/03/03/lets-go-back-to-commenting-on-the-weather/")

testpost <- AllandSundry_test %>% 
html_node("#contentmiddle") %>%
html_text() %>%
as.character()

我想排除 ID 为contenttitle"和commentblock"的两个节点.下面,我尝试使用标签commentblock"排除评论.

I want to exclude the two nodes with ID's "contenttitle" and "commentblock". Below, I try excluding just the comments using the tag "commentblock".

 testpost <- AllandSundry_test %>% 
   html_node("#contentmiddle") %>%
   html_node(":not(#commentblock)")
   html_text() %>%
   as.character()

当我运行它时,结果只是日期——所有其余的文本都消失了.有什么建议吗?

When I run this, the result is simply the date -- all the rest of the text is gone. Any suggestions?

我花了很多时间寻找答案,但我是 R(和 html)的新手,所以如果这很明显,我感谢您的耐心等待.

I have spent a lot of time searching for an answer, but I am new to R (and html), so I appreciate your patience if this is something obvious.

推荐答案

您就快到了.您应该使用 html_nodes 而不是 html_node.

You were almost there. You should use html_nodes instead of html_node.

html_node 检索它遇到的第一个元素,而 html_nodes 将页面中的每个匹配元素作为列表返回.
toString() 函数将字符串列表合并为一个.

html_node retrieves the first element it encounter, while html_nodes returns each matching element in the page as a list.
The toString() function collapse the list of strings into one.

library(rvest)

AllandSundry_test <- read_html("http://www.sundrymourning.com/2017/03/03/lets-go-back-to-commenting-on-the-weather/")

testpost <- AllandSundry_test %>% 
  html_nodes("#contentmiddle>:not(#commentblock)") %>% 
  html_text %>%
  as.character %>%
  toString

testpost
#> [1] "\n\t\tMar\n\t\t3\n\t, Mar, 3, \n\t\tLet's go back to 
#> commenting on the weather\n\t\t\n\t\t, Let's go back to commenting on 
#> the weather, Let's go back to commenting on the weather, I have just 
#> returned from the grocery store, and I need to get something off my chest. 
#> When did "Got any big plans for the rest of the day?" become 
#> the default small ...<truncated>

您仍然需要稍微清理一下字符串.

You still need to clean up the string a bit.

这篇关于排除节点 RVest的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆