Jsoup - 提取文本 [英] Jsoup - extracting text

查看:201
本文介绍了Jsoup - 提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从这样的节点中提取文本:

I need to extract text from a node like this:

<div>
    Some text <b>with tags</b> might go here.
    <p>Also there are paragraphs</p>
    More text can go without paragraphs<br/>
</div>

我需要建立:

Some text <b>with tags</b> might go here.
Also there are paragraphs
More text can go without paragraphs

Element.text 只返回div的所有内容。 Element.ownText - 不在子元素内的所有内容。两者都错了。迭代 children 忽略文本节点。

Element.text returns just all content of the div. Element.ownText - everything that is not inside children elements. Both are wrong. Iterating through children ignores text nodes.

是否有办法迭代元素的内容以接收文本节点为好。例如

Is there are way to iterate contents of an element to receive text nodes as well. E.g.


  • 文本节点 - 一些文字

  • 节点< b> - 带标签

  • 文本节点 - 可能会在这里。

  • 节点< p> - 还有段落

  • 文本节点 - 更多文字可以没有段落

  • 节点< br> - < empty>

  • Text node - Some text
  • Node <b> - with tags
  • Text node - might go here.
  • Node <p> - Also there are paragraphs
  • Text node - More text can go without paragraphs
  • Node <br> - <empty>

推荐答案

Element.children()返回元素对象 - 元素对象。查看父类节点,您将看到可以访问任意节点的方法,而不仅仅是元素,例如 Node.childNodes()

Element.children() returns an Elements object - a list of Element objects. Looking at the parent class, Node, you'll see methods to give you access to arbitrary nodes, not just Elements, such as Node.childNodes().

public static void main(String[] args) throws IOException {
    String str = "<div>" +
            "    Some text <b>with tags</b> might go here." +
            "    <p>Also there are paragraphs</p>" +
            "    More text can go without paragraphs<br/>" +
            "</div>";

    Document doc = Jsoup.parse(str);
    Element div = doc.select("div").first();
    int i = 0;

    for (Node node : div.childNodes()) {
        i++;
        System.out.println(String.format("%d %s %s",
                i,
                node.getClass().getSimpleName(),
                node.toString()));
    }
}

结果:


1 TextNode 
 Some text 
2 Element <b>with tags</b>
3 TextNode  might go here. 
4 Element <p>Also there are paragraphs</p>
5 TextNode  More text can go without paragraphs
6 Element <br/>

这篇关于Jsoup - 提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆