用Nokogiri修复无效的HTML(删除无效标签) [英] Repairing invalid HTML with Nokogiri (removing invalid tags)

查看:184
本文介绍了用Nokogiri修复无效的HTML(删除无效标签)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 tidy-ext gem 整理一些检索到的HTML。但是,当HTML被破坏时,它会失败,所以我试图首先使用Nokogiri修复HTML:

  repaired_html = Nokogiri :: HTML.parse(a.raw_html).to_html 

它似乎做得很好,但最近我遇到了一个例子,其中人们将FBML标记插入到HTML文档中,例如< fb:like> ,虽然Nokogiri无效,但仍能保存。 Tidy然后说错误:< fb:like>我不明白!这是可以理解的。



我想知道是否还有其他选项,比如 strict 或一些强制Nokogiri只包含有效的HTML标签并忽略其他所有内容的东西?

解决方案

您可以使用Nokogiri的XML解析器解析HTML,默认情况下是严格的,但这只会有所帮助,因为它仍然会进行修正,所以HTML / XML在边界上是正确的。通过调整可以传递给解析器的标志,可以使Nokogiri更加严格,因此它会拒绝返回无效的文档。 Nokogiri不是所需标签的清洁剂或白名单。查看丝瓜络 html 的变量中,那么您可以使用nofollow>净化功能。

>,您可以:

  doc = Nokogiri :: XML.parse(html)

然后检查 doc.errors 以查看是否有错误。 Nokogiri会尝试修复它们,但任何产生错误的东西都会在那里标记。



例如:

 Nokogiri :: XML('< fb:like>< / fb:like>')。errors 
=> [#< Nokogiri :: XML :: SyntaxError:命名空间前缀fb on like未定义]>

Nokogiri将尝试修正HTML:

  Nokogiri :: XML('< fb:like>< / fb :like>')。to_xml 
=> <?xml version = \1.0 \?> \\\
< like /> \\\

但它只是将它更正为移除标记上的未知名称空间。



如果您想剥离这些节点: p>

  doc = Nokogiri :: XML('< fb:like>< / fb:like>')
doc .search('like')。each {| n | n.remove}
doc.to_xml => <?xml version = \1.0 \?> \\\


I'm trying to tidy some retrieved HTML using the tidy-ext gem. However, it fails when the HTML is quite broken, so I'm trying to repair the HTML using Nokogiri first:

repaired_html = Nokogiri::HTML.parse(a.raw_html).to_html

It seems to do a nice job but lately I encountered a sample where people inserted FBML markup into the HTML document such as <fb:like> which is somehow preserved by Nokogiri although being invalid. Tidy then says Error: <fb:like> is not recognized! which is understandable.

I'm wondering if there are any more options like strict or something which forces Nokogiri only to include valid HTML tags and omit everything else?

解决方案

You can parse HTML using Nokogiri's XML parser, which is strict by default but that only helps a little, because it will still do fixups so the HTML/XML is marginally correct. By adjusting the flags you can pass to the parser you can make Nokogiri even more rigid so it will refuse to return an invalid document. Nokogiri is not a sanitizer or a white-list for desired tags. Check out Loofah and Sanitize for that functionality.

If your HTML content is in a variable called html, and you do:

doc = Nokogiri::XML.parse(html)

then check doc.errors afterwards to see if you had errors. Nokogiri will attempt to fix them, but anything that generated an error will be flagged there.

For instance:

Nokogiri::XML('<fb:like></fb:like>').errors
=> [#<Nokogiri::XML::SyntaxError: Namespace prefix fb on like is not defined>]

Nokogiri will attempt to fix up the HTML:

Nokogiri::XML('<fb:like></fb:like>').to_xml
=> "<?xml version=\"1.0\"?>\n<like/>\n"

but it only corrects it to the point of removing the unknown namespace on the tag.

If you want to strip those nodes:

doc = Nokogiri::XML('<fb:like></fb:like>')
doc.search('like').each{ |n| n.remove }
doc.to_xml => "<?xml version=\"1.0\"?>\n"

这篇关于用Nokogiri修复无效的HTML(删除无效标签)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆