如何使用JSoup以正确的顺序遍历html的文本和属性 [英] How to iterate through a html of texts and attributes in their correct order using JSoup

查看:197
本文介绍了如何使用JSoup以正确的顺序遍历html的文本和属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用JSoup以正确的顺序遍历html文本和属性.

How do you iterate through a html of texts and attributes in their correct order using JSoup.

<a href="link1"> text child 1</a>
own text 1
<b> text child 2</b>
own text 2

我想对每个属性/文本进行一些处理. 例如最终输出可能类似于以下内容:-

I want to do some processing for each attribute / text. e.g. final output may be something like the following: -

1) text child 1 (is a link)
2) own text 1 
3) text child 2 (is bold)
4) own text 2

当前,我可以迭代子元素

currently, I can iterate the children elements

Elements elements = element.children(); //gives my child 1 and 2;
for(element e: elements){ 
    //... do processing plus extract childText... 
}

或获取OwnText,但我不知道如何一起使用.

or get OwnText, but I don't know how to do both together.

String text = element.ownText(); // gives me own text 1 and 2;

我也不想使用(因为行信息丢失了)

Also, I do not want to use (because the row information is lost)

String text =element.Text(); 

我如何遍历元素以便可以获取

How can I iterate through element such that i can get

child 1 -> text 1 -> child 2 -> text 2 (where text 1 and 2 are separated)

推荐答案

如果您的HTML不太复杂,则可以使用:

If your HTML is not very complex you could use:

for (Node node : document.body().childNodes()) {
    if (node instanceof TextNode) {
        System.out.println(((TextNode) node).text());
    } else if (node instanceof Element) {
        System.out.println(((Element) node).ownText());
    }
}

如果更为复杂,则可以递归地遍历元素树:

If it is more complex, you could go recursively through the element tree:

public static void main(String[] args) {
    try {
        Document document = Jsoup
                .parse("<a href=\"link1\"> text child 1</a>\r\n" + "own text 1\r\n" + "<b> text child 2</b>\r\n" + "own text 2");

        handleElement(document.body());
    } catch (Exception e) {
        e.printStackTrace();
    }
}

public static void handleElement(Node parent) {
    if (parent instanceof TextNode) {
        System.out.println(((TextNode) parent).text());
    }
    for (Node node : parent.childNodes()) {
        handleElement(node);
    }
}

此代码打印出您所描述的内容:

This code prints out what you described:

int counter = 1;
for (Node node : document.body().childNodes()) {
    if (node instanceof TextNode) {
        System.out.println(counter++ + ") " + ((TextNode) node).text().trim());
    } else if (node instanceof Element) {
        Element element = (Element) node;
        String suffix = "";
        if ("a".equals(element.tagName())) {
            suffix = " (is a link)";
        } else if ("b".equals(element.tagName())) {
            suffix = " (is bold)";
        }
        System.out.println(counter++ + ") " + element.ownText() + suffix);
    }
}

1)文本子项1(是链接)
2)自己的文字1
3)文字子项2(粗体)
4)自己的文字2

1) text child 1 (is a link)
2) own text 1
3) text child 2 (is bold)
4) own text 2

这篇关于如何使用JSoup以正确的顺序遍历html的文本和属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆