如何使用JSoup以正确的顺序遍历html的文本和属性 [英] How to iterate through a html of texts and attributes in their correct order using JSoup
问题描述
如何使用JSoup以正确的顺序遍历html文本和属性.
How do you iterate through a html of texts and attributes in their correct order using JSoup.
<a href="link1"> text child 1</a>
own text 1
<b> text child 2</b>
own text 2
我想对每个属性/文本进行一些处理. 例如最终输出可能类似于以下内容:-
I want to do some processing for each attribute / text. e.g. final output may be something like the following: -
1) text child 1 (is a link)
2) own text 1
3) text child 2 (is bold)
4) own text 2
当前,我可以迭代子元素
currently, I can iterate the children elements
Elements elements = element.children(); //gives my child 1 and 2;
for(element e: elements){
//... do processing plus extract childText...
}
或获取OwnText,但我不知道如何一起使用.
or get OwnText, but I don't know how to do both together.
String text = element.ownText(); // gives me own text 1 and 2;
我也不想使用(因为行信息丢失了)
Also, I do not want to use (because the row information is lost)
String text =element.Text();
我如何遍历元素以便可以获取
How can I iterate through element such that i can get
child 1 -> text 1 -> child 2 -> text 2 (where text 1 and 2 are separated)
推荐答案
如果您的HTML不太复杂,则可以使用:
If your HTML is not very complex you could use:
for (Node node : document.body().childNodes()) {
if (node instanceof TextNode) {
System.out.println(((TextNode) node).text());
} else if (node instanceof Element) {
System.out.println(((Element) node).ownText());
}
}
如果更为复杂,则可以递归地遍历元素树:
If it is more complex, you could go recursively through the element tree:
public static void main(String[] args) {
try {
Document document = Jsoup
.parse("<a href=\"link1\"> text child 1</a>\r\n" + "own text 1\r\n" + "<b> text child 2</b>\r\n" + "own text 2");
handleElement(document.body());
} catch (Exception e) {
e.printStackTrace();
}
}
public static void handleElement(Node parent) {
if (parent instanceof TextNode) {
System.out.println(((TextNode) parent).text());
}
for (Node node : parent.childNodes()) {
handleElement(node);
}
}
此代码打印出您所描述的内容:
This code prints out what you described:
int counter = 1;
for (Node node : document.body().childNodes()) {
if (node instanceof TextNode) {
System.out.println(counter++ + ") " + ((TextNode) node).text().trim());
} else if (node instanceof Element) {
Element element = (Element) node;
String suffix = "";
if ("a".equals(element.tagName())) {
suffix = " (is a link)";
} else if ("b".equals(element.tagName())) {
suffix = " (is bold)";
}
System.out.println(counter++ + ") " + element.ownText() + suffix);
}
}
1)文本子项1(是链接)
2)自己的文字1
3)文字子项2(粗体)
4)自己的文字2
1) text child 1 (is a link)
2) own text 1
3) text child 2 (is bold)
4) own text 2
这篇关于如何使用JSoup以正确的顺序遍历html的文本和属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!