使用Ruby从HTML文档中剥离文本 [英] Strip text from HTML document using Ruby

查看：114 发布时间：2018/6/15 11:02:55 html ruby nokogiri hpricot

本文介绍了使用Ruby从HTML文档中剥离文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有很多关于如何使用Ruby去除文档中HTML标签的例子，Hpricot和Nokogiri都有inner_text方法，可以方便快捷地为您移除所有的HTML。

我想要做的是相反的，从HTML文档中删除所有文本，只留下标签及其属性。

我考虑循环遍历文档设置inner_html为零，但然后真的，你必须做相反的事情，因为第一个元素（root）有一个inner_html的整个文档的其余部分，所以理想情况下，我必须从最内层的元素开始，并将inner_html设置为零，而通过祖先移动。

有谁知道有效地完成这项工作的一个巧妙的小技巧？我在想也许正则表达式可能会这样做，但可能并不像HTML标记器/解析器那样高效。 解析方案

  doc = Nokogiri :: HTML（your_html）
 doc.xpath（// text（））。删除

There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly.

What I am trying to do is the opposite, remove all the text from an HTML document, leaving just the tags and their attributes.

I considered looping through the document setting inner_html to nil but then really you'd have to do this in reverse as the first element (root) has an inner_html of the entire rest of the document, so ideally I'd have to start at the inner most element and set inner_html to nil whilst moving up through the ancestors.

Does anyone know a neat little trick for doing this efficiently? I was thinking perhaps regex's might do it but probably not as efficiently as an HTML tokenizer/parser might.
解决方案
This works too:
doc = Nokogiri::HTML(your_html) doc.xpath("//text()").remove

这篇关于使用Ruby从HTML文档中剥离文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Ruby从HTML文档中剥离文本 [英] Strip text from HTML document using Ruby

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用Ruby从HTML文档中剥离文本 [英] Strip text from HTML document using Ruby

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭