使用Jsoup解析XML [英] Parsing XML with Jsoup
问题描述
我得到以下代表新闻文章的XML:
I get the following XML which represents a news article:
<content>
Some text blalalala
<h2>Small subtitle</h2>
Some more text blbla
<ul class="list">
<li>List item 1</li>
<li>List item 2</li>
</ul>
<br />
Even more freakin text
</content>
我知道格式不理想,但现在我必须接受它。
I know the format isn't ideal but for now I have to take it.
该文章应如下所示:
- 一些文字blalalala
- 小字幕
- 带项目的列表
- 更加怪异的文字
- Some text blalalala
- Small subtitle
- List with items
- Even more freakin text
我用Jsoup解析这个XML。我可以使用 doc.ownText()
在< content>
标记内获取文字,但后来我不知道放置其他东西(副标题)的地方,我只得到一个大的字符串
。
I parse this XML with Jsoup. I can get the text within the <content>
tag with doc.ownText()
but then I have no idea where the other stuff (subtitle) is placed, I get only one big String
.
它会更好吗?使用基于事件的解析器(我讨厌它们:()或者是否有可能做类似 doc.getTextUntilTagAppears(tagName)
?
Would it be better to use an event based parser for this (I hate them :() or is there a possibility to do something like doc.getTextUntilTagAppears("tagName")
?
编辑:为了澄清,我知道在< content>
下获取元素很热,我的问题是获取< content>
,每次被元素打断时都会被分解。
For clarification, I know hot to get the elements under <content>
, my problem is with getting the text within <content>
, broken up every time when its interrupted by an element.
我知道我可以获得所有内容中的文字 .textNodes()
,效果很好,但是我又不知道我的文章中哪个文本节点属于哪一个(h2之前的顶部一个,另一个在底部)。
I learned that I can get all the text within content with .textNodes()
, works great, but then again I don't know where which text node belongs in my article (one at the top before h2, the other one at the bottom).
推荐答案
我犯的错误是通过元素浏览XML
,不包括 TextNodes
。当我按节点点击它时,我可以检查节点
是元素
还是 TextNode
,这样我可以相应地对待它们。
The mistake I made was going through the XML by Elements
, which do not include TextNodes
. When I go through it Node by Node, I can check wether the Node
is an Element
or a TextNode
, that way I can treat them accordingly.
这篇关于使用Jsoup解析XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!