Jsoup-如何提取每个元素 [英] Jsoup - How to extract every elements
本文介绍了Jsoup-如何提取每个元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试使用Jsoup获取字体信息.例如:
I'm trying to get font information by using Jsoup. For an example:
下面是我的代码:
result = rtfToHtml(new StringReader(streamToString((InputStream)contents.getTransferData(dfRTF))));
// Example of text extraction from html
// Parse html
// String test = result.toString();
Document doc = Jsoup.parse(result);
// Select first bold text
String strdoc = doc.toString();
String words[] = strdoc.split("font-family");
Element firstBoldElt = doc.select("b").first();
Elements ele = doc.select("body");
String test = ele.toString();
Elements all = doc.select("b");
String boldtext = all.text();
通过使用代码,我的输出将如下所示:
By using the code my output will be like below:
"<body>
<p class="default">
<span style="color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;">
<b>Hello World</b>
</span>
<span style="color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;">, Testing</span>
<span style="color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;">
<i><b>Font </b></i>
</span>
<span style="color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;"> Style</span>
<span style="color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;">
<i>Check</i>
</span>
<span style="color: #000000; font-size: 10pt; font-family: MyriadPro-Bold;"></span>
</p>
</body>"
我可以提取第一个 BOLD 元素或所有 BOLD 元素,但是如何将所有类似的元素提取出来.
I can extract first BOLD element or all BOLD element but how do I can all element similar like this.
<b>Hello World</b>
, Testing
<i><b>Font </b></i>
Style
<i>Check</i>
任何建议或参考都将受到高度赞赏.
已编辑
Any advice or references is highly appreciated.
EDITED
<body lang="en-MY" dir="LTR">
<p style="margin-bottom: 0in">
<font color="#000000"> <font face="ArialMT, serif"> <font size="2">
<span style="font-style: normal">
<span style="text-decoration: none">
<b>BOLD </b>
</span>
</span>
</font></font></font>
<font color="#000000"><font face="ArialMT, serif"><font size="2">
<span style="font-style: normal">
<span style="text-decoration: none">
<span style="font-weight: normal">
REGULAR
</span>
</span>
</span>
</font></font></font>
<font color="#000000"><font face="ArialMT, serif"><font size="2">
<span style="font-style: normal">
<u>
<span style="font-weight: normal">
UNDERLINED
</span>
</u>
</span>
</font></font></font>
<font color="#000000"><font face="ArialMT, serif"><font size="2">
<span style="font-style: normal">
<span style="text-decoration: none">
<span style="font-weight: normal">
</span>
</span>
</span>
</font></font></font>
<font color="#000000"><font face="ArialMT, serif"><font size="2">
<i>
<span style="text-decoration: none">
<span style="font-weight: normal">
ITALIC
</span>
</span>
</i>
</font></font></font>
<font color="#000000"><font face="ArialMT, serif"><font size="2">
<span style="font-style: normal">
<span style="text-decoration: none">
<span style="font-weight: normal">
</span>
</span>
</span>
</font></font></font>
<font color="#000000"><font face="ArialMT, serif"><font size="2">
<i>
<span style="text-decoration: none">
<b>BOLDITALIC</b>
</span>
</i></font>
</font></font></p>
</body>
推荐答案
如果只需要从文档中提取文本,再加上任何<b>
或<i>
标记(根据您的示例),请考虑使用白名单类(请参见 docs ):
If you only need to extract the text from a document, plus any <b>
or <i>
tags (as per your example), consider using the Whitelist class (see docs):
String html = "<body><p class='default'> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> <b>Hello World</b> </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> , Testing </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> <i><b>Font </b></i> </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> Style </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> <i>Check</i> </span> <span style='color: #000000; font-size: 10pt; font-family: MyriadPro-Bold;'> </span> </p></body>";
Whitelist wl = Whitelist.simpleText();
wl.addTags("b", "i"); // add additional tags here as necessary
String clean = Jsoup.clean(html, wl);
System.out.println(clean);
将输出(根据您的示例):
Which will output (as per your example):
11-07 19:04:45.738: I/System.out(318): <b>Hello World</b> , Testing
11-07 19:04:45.738: I/System.out(318): <i><b>Font </b></i> Style
11-07 19:04:45.738: I/System.out(318): <i>Check</i>
更新:
Update:
ArrayList<String> elements = new ArrayList<String>();
Elements e = doc.select("span");
for (int i = 0; i < e.size(); i++) {
elements.add(e.get(i).html());
}
这篇关于Jsoup-如何提取每个元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文