解析包含   的 html(不间断空格) [英] parsing html containing   (non-breaking space)
问题描述
我正在使用 rvest
来解析网站.我正在用这些不间断的小空间撞墙.如何删除解析的 html 文档中由
元素创建的空格?
I am using rvest
to parse a website. I'm hitting a wall with these little non-breaking spaces. How does one remove the whitespace that is created by the
element in a parsed html document?
library("rvest")
library("stringr")
minimal <- html("<!doctype html><title>blah</title> <p> foo")
bodytext <- minimal %>%
html_node("body") %>%
html_text
现在我已经提取了正文:
Now I have extracted the body text:
bodytext
[1] " foo"
但是,我无法删除那个讨厌的空格!
However, I can't remove that pesky bit of whitespace!
str_trim(bodytext)
gsub(pattern = " ", "", bodytext)
推荐答案
jdharrison 回答:
jdharrison answered:
gsub("\\W", "", bodytext)
而且,这会起作用,但您可以使用:
and, that will work but you can use:
gsub("[[:space:]]", "", bodytext)
这将删除所有空格字符:制表符、换行符、垂直制表符、换页、回车、空格和可能的其他与语言环境相关的字符
.与其他神秘的正则表达式类相比,这是一个非常易读的替代品.
which will remove all Space characters: tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters
. It's a very readable alternative to other, cryptic regex classes.
这篇关于解析包含 &nbsp; 的 html(不间断空格)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!