jsoup标签提取问题 [英] jsoup tag extraction problem
问题描述
Elements size = doc.select("div:contains(test:)");
我如何从这个html标记中提取值example和example1 ....使用jsoup ..
how can i extract the value example and example1 from this html tag....using jsoup..
推荐答案
由于此HTML的语义不足以实现您的最终目的(<br>
不能有子级,而:
不是HTML),因此可以'使用诸如Jsoup之类的HTML解析器不能做很多事情. HTML解析器无意执行特定的 text 提取/标记化工作.
Since this HTML is not semantic enough for the final purpose you have (a <br>
cannot have children and :
is not HTML), you can't do much with a HTML parser like Jsoup. A HTML parser isn't intented to do the job of specific text extraction/tokenizing.
最好的办法是使用Jsoup获取<div>
的HTML内容,然后使用常规的java.lang.String
或java.util.Scanner
方法进一步提取该内容.
Best what you can do is to get the HTML content of the <div>
using Jsoup and then extract that further using the usual java.lang.String
or maybe java.util.Scanner
methods.
这是一个开球示例:
String html = "<div style=\"height:240px;\"><br>test: example<br>test1:example1</div>";
Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
String[] parts = div.html().split("<br />"); // Jsoup transforms <br> to <br />.
for (String part : parts) {
int colon = part.indexOf(':');
if (colon > -1) {
System.out.println(part.substring(colon + 1).trim());
}
}
这导致
example
example1
如果我是HTML作者,则应该使用定义列表.例如
If I was the HTML author, I would have used a definition list for this. E.g.
<dl id="mydl">
<dt>test:</dt><dd>example</dd>
<dt>test1:</dt><dd>example1</dd>
</dl>
这更具语义,因此更容易解析:
This is more semantic and thus more easy parseable:
String html = "<dl id=\"mydl\"><dt>test:</dt><dd>example</dd><dt>test1:</dt><dd>example1</dd></dl>";
Document document = Jsoup.parse(html);
Elements dts = document.select("#mydl dd");
for (Element dt : dts) {
System.out.println(dt.text());
}
这篇关于jsoup标签提取问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!