如何防止Nokogiri添加<DOCTYPE>标签? [英] How to prevent Nokogiri from adding <DOCTYPE> tags?
问题描述
我最近在使用 Nokogiri 时发现了一些奇怪的东西.我解析的所有 HTML 都被赋予了开始和结束 和
标签.
<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<正文>\n
我怎样才能阻止 Nokogiri 这样做?
即,当我这样做时:
doc = Nokogiri::HTML("一些内容")doc.to_s
或:
doc.to_html
我得到了原文:
div>一些内容
出现问题是因为您在 Nokogiri 中使用了错误的方法来解析您的内容.
需要'nokogiri'doc = Nokogiri::HTML('<p>foobar</p>')把 doc.to_html#>><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">#>><html><body><p>foobar</p></body></html>
与其使用 HTML
生成完整的文档,不如使用 HTML.fragment
,它告诉 Nokogiri 您只需要解析片段:
doc = Nokogiri::HTML.fragment('<p>foobar</p>')把 doc.to_html#>><p>foobar</p>
I noticed something strange using Nokogiri recently. All of the HTML I had been parsing had been given start and end <html>
and <body>
tags.
<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n
How can I prevent Nokogiri from doing this?
I.E., when I do:
doc = Nokogiri::HTML("<div>some content</div>")
doc.to_s
or:
doc.to_html
I get the original:
<html blah><body>div>some content</div></body></html>
The problem occurs because you're using the wrong method in Nokogiri to parse your content.
require 'nokogiri'
doc = Nokogiri::HTML('<p>foobar</p>')
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><p>foobar</p></body></html>
Rather than using HTML
which results in a complete document, use HTML.fragment
, which tells Nokogiri you only want the fragment parsed:
doc = Nokogiri::HTML.fragment('<p>foobar</p>')
puts doc.to_html
# >> <p>foobar</p>
这篇关于如何防止Nokogiri添加<DOCTYPE>标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!