解析包含 &nbsp; 的 html(不间断空格) [英] parsing html containing &nbsp; (non-breaking space)

查看：93 发布时间：2021/8/31 18:45:26 r stringr

本文介绍了解析包含 &nbsp; 的 html(不间断空格)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 rvest 来解析网站.我正在用这些不间断的小空间撞墙.如何删除解析的 html 文档中由   元素创建的空格?

I am using rvest to parse a website. I'm hitting a wall with these little non-breaking spaces. How does one remove the whitespace that is created by the   element in a parsed html document?

library("rvest")
library("stringr")  

minimal <- html("<!doctype html><title>blah</title> <p>&nbsp;foo")

bodytext <- minimal %>%
  html_node("body") %>% 
  html_text

现在我已经提取了正文:

Now I have extracted the body text:

bodytext
[1] " foo"

但是，我无法删除那个讨厌的空格！

However, I can't remove that pesky bit of whitespace!

str_trim(bodytext)

gsub(pattern = " ", "", bodytext)

推荐答案

jdharrison 回答:

jdharrison answered:

gsub("\\W", "", bodytext)

而且，这会起作用，但您可以使用:

and, that will work but you can use:

gsub("[[:space:]]", "", bodytext)

这将删除所有空格字符:制表符、换行符、垂直制表符、换页、回车、空格和可能的其他与语言环境相关的字符.与其他神秘的正则表达式类相比，这是一个非常易读的替代品.

which will remove all Space characters: tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters. It's a very readable alternative to other, cryptic regex classes.

这篇关于解析包含 &nbsp; 的 html(不间断空格)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

解析包含 &nbsp; 的 html(不间断空格) [英] parsing html containing &nbsp; (non-breaking space)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

解析包含 &amp;nbsp; 的 html(不间断空格) [英] parsing html containing &amp;nbsp; (non-breaking space)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

解析包含   的 html(不间断空格) [英] parsing html containing   (non-breaking space)

登录关闭