Jsoup解析和嵌套标签 [英] Jsoup parse and nested tags

查看:623
本文介绍了Jsoup解析和嵌套标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习Jsoup并使用以下HTML:

I'm learning Jsoup and have this HTML:

 [...]
 <p style="..."> <!-- div 1 -->
   Content
 </p>
 <p style="..."> <!-- div 2 -->
   Content
 </p>
 <p style="..."> <!-- div 3 -->
   Content
 </p>
 [...]

我使用Jsoup.parse()和文档select("p")捕获内容"(效果很好).但是...

I use Jsoup.parse() and document select("p") for catch "content" (and works nice). But...

 [...]
 <p style="..."> <!-- div 1 -->
   Content
 </p>
 <p style="..."> <!-- div 2 -->
   Content
 </p>
 <p style="..."> <!-- div 3 -->
   Content
   <p style="..."></p>
   <p style="..."></p>
 </p>
 [...]

在此场景中,我看到Jsoup.parse()将此代码转换为:

In this scene, I see that Jsoup.parse() convert this code to:

 [...]
 <p style="..."> <!-- div 1 -->
   Content
 </p>
 <p style="..."> <!-- div 2 -->
   Content
 </p>
 <p style="..."> <!-- div 3 -->
   Content
 </p>
 <p style="..."> <!-- div 4 -->
 </p>
 <p style="..."> <!-- div 5 -->
 </p>
 [...]

如何使用Jsoup(div 3内的div 4& 5)保持嵌套段落的顺序?

How can I keep order of nested paragraphs with Jsoup (div 4 & 5 inside of div 3)?

添加示例:

HTML文件:

 <html>
 <head>
    <title>Title</title>
 </head>
 <body>
    <p style="margin-left:2em">
            <span class="one">Text</span>
            <span class="two"><span class="nest">Text</span></span>
            <span class="three"></span>
    </p>
    <p style="margin-left:2em">
            <span class="one">Text</span>
            <span class="two"><span class="nest">Text</span></span>
            <span class="three"></span>
    </p>
    <p style="margin-left:2em">
            <span class="one">Text</span>
            <span class="two"><span class="nest">Text</span></span>
            <span class="three"></span>
            <p style="margin-left:2em"></p>
            <p style="margin-left:2em"></p>
    </p>

 </body>
 </html>

Java代码:

Document doc = null;
doc = Jsoup.connect(URL_with_HTML).get();
System.out.println(doc.outerHtml());

返回:

<html>
<head> 
 <title>Title</title> 
</head> 
<body> 
 <p style="margin-left:2em"> <span class="one">Text</span> <span class="two"><span class="nest">Text</span></span> <span class="three"></span> </p> 
 <p style="margin-left:2em"> <span class="one">Text</span> <span class="two"><span class="nest">Text</span></span> <span class="three"></span> </p> 
 <p style="margin-left:2em"> <span class="one">Text</span> <span class="two"><span class="nest">Text</span></span> <span class="three"></span> </p>
 <p style="margin-left:2em"></p> 
 <p style="margin-left:2em"></p> 
 <p></p>   
</body>
</html>

这是正确的吗?我使用的是Jsoup 1.6.1.我知道Jsoup应该返回嵌套的段落,而不是先前的返回.

Is correct this? I using Jsoup 1.6.1. I understand that Jsoup should return nested paragraphs instead of previous return.

推荐答案

嵌套的段落在HTML中不存在.由于 Jsoup实现了WHATWG HTML5规范:

Nested paragraphs do not exist in HTML. The prior paragraph is closed automatically since Jsoup implements the WHATWG HTML5 specification:

  1. p标记会被以下任意一项自动关闭:addressarticleasideblockquotedivdlfieldsetfooterformh1h2h3h4h5h6headerhgrouphrmainmenunavolppresectiontableul.因此<p><div></div> becomes <p></p><div></div>.
  2. 名称为p(即</p>)的结束标签没有对应的开始标签,这是解析错误,并用<p>替换.因此<span></span></p>变为<span></span><p>.
  1. A p tag is automatically closed by any of the following: address, article, aside, blockquote, div, dl, fieldset, footer, form, h1, h2, h3, h4, h5, h6, header, hgroup, hr, main, menu, nav, ol, p, pre, section, table, or ul. Therefore <p><div></div> becomes <p></p><div></div>.
  2. An end tag whose name is p (ie </p>) that does not have a corresponding start tag is a parse error and is replaced with <p>. Therefore <span></span></p> becomes <span></span><p>.

因此jsoup是正确的,而您的HTML无效.

So jsoup is correct and your HTML is invalid.

请确保您理解HTML无效,因为您有太多的</p>,而不是因为嵌套"段落.嵌套不会发生,因为它们会自动关闭.但是后来出现的</p>已过时,因为对应的" <p>之前已经自动关闭.

Be sure to comprehend that your HTML is invalid because you have too many </p> and not because "nesting" paragraphs. Nesting cannot happend because they get auto-closed. But the later coming </p> is obsolet because the "corresponding" <p> was already auto-closed before.

这篇关于Jsoup解析和嵌套标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆