使用Ruby从HTML文档中剥离文本 [英] Strip text from HTML document using Ruby

查看:114
本文介绍了使用Ruby从HTML文档中剥离文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有很多关于如何使用Ruby去除文档中HTML标签的例子,Hpricot和Nokogiri都有inner_text方法,可以方便快捷地为您移除所有的HTML。



我想要做的是相反的,从HTML文档中删除所有文本,只留下标签及其属性。



我考虑循环遍历文档设置inner_html为零,但然后真的,你必须做相反的事情,因为第一个元素(root)有一个inner_html的整个文档的其余部分,所以理想情况下,我必须从最内层的元素开始,并将inner_html设置为零,而通过祖先移动。



有谁知道有效地完成这项工作的一个巧妙的小技巧?我在想也许正则表达式可能会这样做,但可能并不像HTML标记器/解析器那样高效。 解析方案

  doc = Nokogiri :: HTML(your_html)
doc.xpath(// text())。删除


There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly.

What I am trying to do is the opposite, remove all the text from an HTML document, leaving just the tags and their attributes.

I considered looping through the document setting inner_html to nil but then really you'd have to do this in reverse as the first element (root) has an inner_html of the entire rest of the document, so ideally I'd have to start at the inner most element and set inner_html to nil whilst moving up through the ancestors.

Does anyone know a neat little trick for doing this efficiently? I was thinking perhaps regex's might do it but probably not as efficiently as an HTML tokenizer/parser might.

解决方案

This works too:

doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").remove

这篇关于使用Ruby从HTML文档中剥离文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆