无法用Jsoup HTML解析器Java实现某些功能 [英] Not able to achieve something with Jsoup HTML parser Java
问题描述
1:这是< b>我的文字< b>我的文字< ; / b个一些其他< b> < / B个文字以及< b>< / b>< b>非空标签1< / b>其他文字
。
预期输出:
<2>
这是< b>我的文字< ; / b个一些其他< b> < / B个文字以及< b>< / b>< b>非空标签2< / b>其他文字
。 预期输出:
<3>
这是< b>我的文字< ; / b个一些其他< b> < / B个文字以及< b>< / b>< b>非空标签2< / b>其他文字< b>< / b> < b>非空tag3< / b>
。 预期输出:一些其他< b> < / B个文字以及< b>< / b>
这里,如果您注意到文字我的文字是修复(静态),但第二个非空(不考虑空间值)B标记值可能会有所不同。正则表达式应该能够提取 我使用Jsoup库,但无法达到上述预期输出。请确保解决方案应该对每种情况都很常见,因为它对我而言是动态的。< b> My Text< / b>< / code>和第一个非空< code>< b> $ b
简单的解决方案看起来像
- 找到
< b>
您感兴趣的元素你正在寻找) - 迭代放置在它后面的兄弟并打印它们,直到找到非空
< b>
您只需要记住Jsoup正在使用 Node
来存储所有元素(包括文本它不属于标签),而元素
类(它扩展了 Node
)可能只包含特定的标签。
因此,例如
之前的文字< b>加粗< ; / b个后< I>斜体< / I>
将表示为
< node> before< / node>
< element tag =B>
< node>粗体< / node>
< / element>
< node>后< /节点>
< element tag =I>
< node>斜体< / node>
< / element>
所以如果你比如 select(b)
(它会找到< element tab =B>
)并调用 nextElementSibling()
会将您移至< element tag =I>
。要在< / node> 之后获得< node>,您需要使用
nextSibling()
简单的文本节点。
$ b $ < Node
类可能存在的问题是它不提供 text()
方法可以生成当前节点的文本内容(这可以让我们测试当前节点/元素是否有任何文本)。但是没有什么能够阻止我们把 Node
处理标签的事情转化为提供这种方法的 Element
。
因此,我们的解决方案可能如下所示:
public static String findFragment(String html,String fixedStart) {
Document doc = Jsoup.parse(html);
元素myBTag = doc
.select(b:matches(^+ Pattern.quote(fixedStart)+$))
.first();
StringBuilder sb = new StringBuilder();
boolean foundNonEmpty = false;
节点currentSibling = myBTag.nextSibling(); (currentSibling!= null&!foundNonEmpty){
if(currentSibling.nodeName()。equals(b)){
Element b =(Element)currentSibling;
if(!b.text()。trim()。isEmpty())
foundNonEmpty = true;
}
sb.append(currentSibling.toString());
currentSibling = currentSibling.nextSibling();
}
return sb.toString();
}
I am not able to parse some text for following scenarios using Jsoup Java Library.
1 : This is <b>My Text</b> some other <b> </b> text as well <b></b><b>non empty tag1</b> other text
.
Expected output : some other <b> </b> text as well <b></b>
2 : This is <b>My Text</b> some other <b> </b> text as well <b></b><b>non empty tag2</b> other text
.
Expected output : some other <b> </b> text as well <b></b>
3 : This is <b>My Text</b> some other <b> </b> text as well <b></b><b>non empty tag2</b> other text <b></b> <b>non empty tag3</b>
.
Expected output : some other <b> </b> text as well <b></b>
Here, if you have noticed the text My Text is fix (static) but the second non empty (don't consider space as value) B tag value may vary. The regex should be able to extract the text between the <b>My Text</b>
and the first occurrence non empty <b>
tag after that.
I am using Jsoup library, but not able to achieve the above expected output. Please make sure that solution should be common for each scenario, because it's dynamic in my case.
Simple solution could look like
- find
<b>
element which you are interested in (the one with text you are looking for) - iterate over siblings placed after it and print them until you find non empty
<b>
You just need to remember that Jsoup is using Node
to store all elements (including text which doesn't belong to tags), while Element
class (which extends Node
) may contain only specific tags.
So for instance text like
before <b>bold</b> after<i>italic</i>
will be represented as
<node>before </node>
<element tag="B">
<node>bold</node>
</element>
<node> after</node>
<element tag="I">
<node>italic</node>
</element>
So if for instance you select("b")
(which will find <element tab="B">
) and call nextElementSibling()
it will move you to <element tag="I">
. To get <node>after</node>
you will need to use nextSibling()
which doesn't eliminate simple text nodes.
Possible problem with Node
class is that it doesn't provide text()
method which can generate textual content of current node (which could allow us to test if current node/element has any text). But nothing stops us from casting Node
which handles tag to Element
which provides such method.
So our solution could look like:
public static String findFragment(String html, String fixedStart) {
Document doc = Jsoup.parse(html);
Element myBTag = doc
.select("b:matches(^" + Pattern.quote(fixedStart) + "$)")
.first();
StringBuilder sb = new StringBuilder();
boolean foundNonEmpty = false;
Node currentSibling = myBTag.nextSibling();
while (currentSibling != null && !foundNonEmpty) {
if (currentSibling.nodeName().equals("b")) {
Element b = (Element) currentSibling;
if (!b.text().trim().isEmpty())
foundNonEmpty = true;
}
sb.append(currentSibling.toString());
currentSibling = currentSibling.nextSibling();
}
return sb.toString();
}
这篇关于无法用Jsoup HTML解析器Java实现某些功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!