Jsoup解析和嵌套标签 [英] Jsoup parse and nested tags
问题描述
我正在学习Jsoup并使用以下HTML:
I'm learning Jsoup and have this HTML:
[...]
<p style="..."> <!-- div 1 -->
Content
</p>
<p style="..."> <!-- div 2 -->
Content
</p>
<p style="..."> <!-- div 3 -->
Content
</p>
[...]
我使用Jsoup.parse()和文档select("p")捕获内容"(效果很好).但是...
I use Jsoup.parse() and document select("p") for catch "content" (and works nice). But...
[...]
<p style="..."> <!-- div 1 -->
Content
</p>
<p style="..."> <!-- div 2 -->
Content
</p>
<p style="..."> <!-- div 3 -->
Content
<p style="..."></p>
<p style="..."></p>
</p>
[...]
在此场景中,我看到Jsoup.parse()将此代码转换为:
In this scene, I see that Jsoup.parse() convert this code to:
[...]
<p style="..."> <!-- div 1 -->
Content
</p>
<p style="..."> <!-- div 2 -->
Content
</p>
<p style="..."> <!-- div 3 -->
Content
</p>
<p style="..."> <!-- div 4 -->
</p>
<p style="..."> <!-- div 5 -->
</p>
[...]
如何使用Jsoup(div 3内的div 4& 5)保持嵌套段落的顺序?
How can I keep order of nested paragraphs with Jsoup (div 4 & 5 inside of div 3)?
添加示例:
HTML文件:
<html>
<head>
<title>Title</title>
</head>
<body>
<p style="margin-left:2em">
<span class="one">Text</span>
<span class="two"><span class="nest">Text</span></span>
<span class="three"></span>
</p>
<p style="margin-left:2em">
<span class="one">Text</span>
<span class="two"><span class="nest">Text</span></span>
<span class="three"></span>
</p>
<p style="margin-left:2em">
<span class="one">Text</span>
<span class="two"><span class="nest">Text</span></span>
<span class="three"></span>
<p style="margin-left:2em"></p>
<p style="margin-left:2em"></p>
</p>
</body>
</html>
Java代码:
Document doc = null;
doc = Jsoup.connect(URL_with_HTML).get();
System.out.println(doc.outerHtml());
返回:
<html>
<head>
<title>Title</title>
</head>
<body>
<p style="margin-left:2em"> <span class="one">Text</span> <span class="two"><span class="nest">Text</span></span> <span class="three"></span> </p>
<p style="margin-left:2em"> <span class="one">Text</span> <span class="two"><span class="nest">Text</span></span> <span class="three"></span> </p>
<p style="margin-left:2em"> <span class="one">Text</span> <span class="two"><span class="nest">Text</span></span> <span class="three"></span> </p>
<p style="margin-left:2em"></p>
<p style="margin-left:2em"></p>
<p></p>
</body>
</html>
这是正确的吗?我使用的是Jsoup 1.6.1.我知道Jsoup应该返回嵌套的段落,而不是先前的返回.
Is correct this? I using Jsoup 1.6.1. I understand that Jsoup should return nested paragraphs instead of previous return.
推荐答案
嵌套的段落在HTML中不存在.由于 Jsoup实现了WHATWG HTML5规范:
Nested paragraphs do not exist in HTML. The prior paragraph is closed automatically since Jsoup implements the WHATWG HTML5 specification:
-
p
标记会被以下任意一项自动关闭:address
,article
,aside
,blockquote
,div
,dl
,fieldset
,footer
,form
,h1
,h2
,h3
,h4
,h5
,h6
,header
,hgroup
,hr
,main
,menu
,nav
,ol
,p
,pre
,section
,table
或ul
.因此<p><div></div> becomes <p></p><div></div>
. - 名称为
p
(即</p>
)的结束标签没有对应的开始标签,这是解析错误,并用<p>
替换.因此<span></span></p>
变为<span></span><p>
.
- A
p
tag is automatically closed by any of the following:address
,article
,aside
,blockquote
,div
,dl
,fieldset
,footer
,form
,h1
,h2
,h3
,h4
,h5
,h6
,header
,hgroup
,hr
,main
,menu
,nav
,ol
,p
,pre
,section
,table
, orul
. Therefore<p><div></div> becomes <p></p><div></div>
. - An end tag whose name is
p
(ie</p>
) that does not have a corresponding start tag is a parse error and is replaced with<p>
. Therefore<span></span></p>
becomes<span></span><p>
.
因此jsoup是正确的,而您的HTML无效.
So jsoup is correct and your HTML is invalid.
请确保您理解HTML无效,因为您有太多的</p>
,而不是因为嵌套"段落.嵌套不会发生,因为它们会自动关闭.但是后来出现的</p>
已过时,因为对应的" <p>
之前已经自动关闭.
Be sure to comprehend that your HTML is invalid because you have too many </p>
and not because "nesting" paragraphs. Nesting cannot happend because they get auto-closed. But the later coming </p>
is obsolet because the "corresponding" <p>
was already auto-closed before.
这篇关于Jsoup解析和嵌套标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!