Nokogiri生成无效的HTML? [英] Nokogiri generating invalid HTML?

查看:113
本文介绍了Nokogiri生成无效的HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要处理一个HTML文档并在几个地方插入一些节点。我正在处理的内容无效,但Nokogiri足够聪明,能够弄清楚它应该是什么。问题是我不想更改原始文档的格式,除了我插入的部分。



以下是一个示例:

 需要'nokogiri'

orig_html ='

< meta name =Generatorcontent =Microsoft Word 97 O.o>
< body>
1
< b>< p> 2< / p>< / b>
3
< / body>
< / html>'

puts Nokogiri :: HTML(orig_html).inner_html

#>> < HTML>
#>> < HEAD>
#>> < meta http-equiv =Content-Typecontent =text / html; charset = UTF-8>
#>> < meta name =Generatorcontent =Microsoft Word 97 O.o>
#>> < /头>
#>> <身体GT;
#>> 1
#>> < b取代;< / B>< p为H. 2'; / p为H.
#>> 3
#>> < /体>
#>> < / HTML>

我希望输出与输入相同。问题是我在< b> 内不能有< p> 。我的意图是切换到XML,但然后有无效的标签,如< meta> 标签,它没有关闭。 HTML足够聪明地识别这一点,但XML不是。

解决方案

Nokogiri正在修复格式错误的HTML以使其解析的。完成后,DOM处于合理的状态,但原始文档不再可以从Nokogiri获得。



如果您希望原始文档不受影响,您可以必须在将它传递给Nokogiri之前使其有效,然后您可以使用Nokogiri的方法来操作它。通常我会这样做,使用一些正则表达式来查找故障点并添加/调整标签或与它们相关的结束标签,以允许Nokogiri解析而无需修复。



这不是HTML比XML更聪明的例子,它是Nokogiri尊重XML规范精神的一种情况,它是严格的,并通过填充错误数组来提高标志当文件无效时发生错误。 HTML的规范不太严格,而且,因为浏览器在分析和显示HTML时(宽容),Nokogiri会遵循一些规则,进行修正,然后填充 errors 数组。 (在任何一种情况下,您都可以检查该数组以查看错误。)

  require'nokogiri'

orig_html ='
< html>
< meta name =Generatorcontent =Microsoft Word 97 O.o>
< body>
1
< b>< p> 2< / p>< / b>
3
< / body>
< / html>'

doc = Nokogiri :: HTML(orig_html)
doc.errors

doc.errors 包含:

  [
[0]#< Nokogiri :: XML :: SyntaxError:意外的结束标记:b>
]

以下是我如何使用Nokogiri修复您的示例HTML:

  doc = Nokogiri :: HTML(orig_html)
p = doc.at('b + p')
p .previous_sibling.remove

这是HTML:

 <!DOCTYPE html PUBLIC -  // W3C // DTD HTML 4.0 Transitional // ENhttp://www.w3.org/TR/REC -html40 / loose.dtd> 
< html>
< head>
< meta http-equiv =Content-Typecontent =text / html; charset = UTF-8>
< meta name =Generatorcontent =Microsoft Word 97 O.o>
< / head>
< body>
1
< p> 2< / p>
3
< / body>
< / html>

继续:

  p.inner_html =< b>#{p.content}< / b> 
puts doc.to_html

这是生成的HTML:

 <!DOCTYPE html PUBLIC -  // W3C // DTD HTML 4.0 Transitional // ENhttp://www.w3.org/TR /REC-html40/loose.dtd\"> 
< html>
< head>
< meta http-equiv =Content-Typecontent =text / html; charset = UTF-8>
< meta name =Generatorcontent =Microsoft Word 97 O.o>
< / head>
< body>
1
< p>< b> 2< / b>< / p>
3
< / body>
< / html>

很明显,示例HTML并不是您真正使用的,所以您会必须更改访问者以找到需要更改的标签,但这应该让您顺利。


I need to process an HTML document and insert some nodes in a few places. The content I'm processing is not valid, but Nokogiri is smart enough to figure out what it should be. The problem is that I don't want to change the original document's formatting, other than the pieces I'm inserting.

Here is an example:

require 'nokogiri'

orig_html = '
  <html>
  <meta name="Generator" content="Microsoft Word 97 O.o">
  <body>
    1
    <b><p>2</p></b>
    3
  </body>
</html>'

puts Nokogiri::HTML(orig_html).inner_html

# >> <html>
# >> <head>
# >> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
# >> <meta name="Generator" content="Microsoft Word 97 O.o">
# >> </head>
# >> <body>
# >>         1
# >>         <b></b><p>2</p>
# >>         3
# >>       </body>
# >> </html>

I'd like the output to be the same as the input. The problem is that I can't have <p> inside of <b>. My inclination is to switch to XML, but then there are invalid tags such as the <meta> tag, which is not closed off. HTML is smart enough to recognize this, but XML is not.

解决方案

Nokogiri is fixing up the malformed HTML in order to make it parseable. After it has finished the DOM is in a reasonable state, but the original document isn't available from Nokogiri any more.

If you want the original to be untouched, you have to make it valid prior to passing it to Nokogiri, then you can manipulate it using Nokogiri's methods. Typically I'd do that using some regex to find the trouble spots and add/adjust tags or their associated closing tags, to allow Nokogiri to parse without needing to fix things.

It's not a case of HTML being smarter than XML, it's a case of Nokogiri honoring the spirit of the XML specification, which is rigid, and raising flags by populating the errors array with the errors when the file is invalid. HTML has a less rigid specification, and, because browsers are (too) forgiving when parsing and displaying HTML, Nokogiri follows along somewhat, does fixups, and then populates the errors array. (In either case, you can check that array to see what's wrong.)

require 'nokogiri'

orig_html = '
  <html>
  <meta name="Generator" content="Microsoft Word 97 O.o">
  <body>
    1
    <b><p>2</p></b>
    3
  </body>
</html>'

doc = Nokogiri::HTML(orig_html)
doc.errors

doc.errors contains:

[
    [0] #<Nokogiri::XML::SyntaxError: Unexpected end tag : b>
]

Here's how I'd use Nokogiri to fix your sample HTML:

doc = Nokogiri::HTML(orig_html)
p = doc.at('b+p')
p.previous_sibling.remove

This is the HTML at this point:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 97 O.o">
</head>
<body>
    1
    <p>2</p>
    3
  </body>
</html>

Continuing:

p.inner_html = "<b>#{p.content}</b>"
puts doc.to_html

This is the resulting HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 97 O.o">
</head>
<body>
    1
    <p><b>2</b></p>
    3
  </body>
</html>

It's pretty obvious the sample HTML isn't what you're really working with, so you'll have to change the accessors to locate the tags that need to be changed, but that should get you going.

这篇关于Nokogiri生成无效的HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆