Jsoup - 提取文本 [英] Jsoup - extracting text
问题描述
我需要从这样的节点中提取文本:
I need to extract text from a node like this:
<div>
Some text <b>with tags</b> might go here.
<p>Also there are paragraphs</p>
More text can go without paragraphs<br/>
</div>
我需要建立:
Some text <b>with tags</b> might go here.
Also there are paragraphs
More text can go without paragraphs
Element.text
只返回div的所有内容。 Element.ownText
- 不在子元素内的所有内容。两者都错了。迭代 children
忽略文本节点。
Element.text
returns just all content of the div. Element.ownText
- everything that is not inside children elements. Both are wrong. Iterating through children
ignores text nodes.
是否有办法迭代元素的内容以接收文本节点为好。例如
Is there are way to iterate contents of an element to receive text nodes as well. E.g.
- 文本节点 - 一些文字
- 节点< b> - 带标签
- 文本节点 - 可能会在这里。
- 节点< p> - 还有段落
- 文本节点 - 更多文字可以没有段落
- 节点< br> - < empty>
- Text node - Some text
- Node <b> - with tags
- Text node - might go here.
- Node <p> - Also there are paragraphs
- Text node - More text can go without paragraphs
- Node <br> - <empty>
推荐答案
Element.children()返回元素对象 - 元素对象。查看父类节点,您将看到可以访问任意节点的方法,而不仅仅是元素,例如 Node.childNodes()。
Element.children() returns an Elements object - a list of Element objects. Looking at the parent class, Node, you'll see methods to give you access to arbitrary nodes, not just Elements, such as Node.childNodes().
public static void main(String[] args) throws IOException {
String str = "<div>" +
" Some text <b>with tags</b> might go here." +
" <p>Also there are paragraphs</p>" +
" More text can go without paragraphs<br/>" +
"</div>";
Document doc = Jsoup.parse(str);
Element div = doc.select("div").first();
int i = 0;
for (Node node : div.childNodes()) {
i++;
System.out.println(String.format("%d %s %s",
i,
node.getClass().getSimpleName(),
node.toString()));
}
}
结果:
1 TextNode
Some text
2 Element <b>with tags</b>
3 TextNode might go here.
4 Element <p>Also there are paragraphs</p>
5 TextNode More text can go without paragraphs
6 Element <br/>
这篇关于Jsoup - 提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!