无法用Jsoup HTML解析器Java实现某些功能 [英] Not able to achieve something with Jsoup HTML parser Java

查看:88
本文介绍了无法用Jsoup HTML解析器Java实现某些功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



1:这是< b>我的文字< b>我的文字< ; / b个一些其他< b> < / B个文字以及< b>< / b>< b>非空标签1< / b>其他文字



预期输出: 其他一些< b> < / B个文字以及< b>< / b>



<2> 这是< b>我的文字< ; / b个一些其他< b> < / B个文字以及< b>< / b>< b>非空标签2< / b>其他文字



预期输出: 其他一些< b> < / B个文字以及< b>< / b>



<3> 这是< b>我的文字< ; / b个一些其他< b> < / B个文字以及< b>< / b>< b>非空标签2< / b>其他文字< b>< / b> < b>非空tag3< / b>



预期输出:一些其他< b> < / B个文字以及< b>< / b>



这里,如果您注意到文字我的文字是修复(静态),但第二个非空(不考虑空间值)B标记值可能会有所不同。正则表达式应该能够提取< b> My Text< / b>< / code>和第一个非空< code>< b> $ b

我使用Jsoup库,但无法达到上述预期输出。请确保解决方案应该对每种情况都很常见,因为它对我而言是动态的。

解决方案

简单的解决方案看起来像




  • 找到< b> 您感兴趣的元素你正在寻找)

  • 迭代放置在它后面的兄弟并打印它们,直到找到非空< b>



您只需要记住Jsoup正在使用 Node 来存储所有元素(包括文本它不属于标签),而元素类(它扩展了 Node )可能只包含特定的标签。



因此,例如

 之前的文字< b>加粗< ; / b个后< I>斜体< / I> 

将表示为

 < node> before< / node> 
< element tag =B>
< node>粗体< / node>
< / element>
< node>后< /节点>
< element tag =I>
< node>斜体< / node>
< / element>

所以如果你比如 select(b)(它会找到< element tab =B> )并调用 nextElementSibling()会将您移至< element tag =I> 。要在< / node> 之后获得< node>,您需要使用 nextSibling()简单的文本节点。
$ b $ < Node 类可能存在的问题是它不提供 text()方法可以生成当前节点的文本内容(这可以让我们测试当前节点/元素是否有任何文本)。但是没有什么能够阻止我们把 Node 处理标签的事情转化为提供这种方法的 Element



因此,我们的解决方案可能如下所示:

  public static String findFragment(String html,String fixedStart) {

Document doc = Jsoup.parse(html);
元素myBTag = doc
.select(b:matches(^+ Pattern.quote(fixedStart)+$))
.first();

StringBuilder sb = new StringBuilder();
boolean foundNonEmpty = false;

节点currentSibling = myBTag.nextSibling(); (currentSibling!= null&!foundNonEmpty){
if(currentSibling.nodeName()。equals(b)){
Element b =(Element)currentSibling;
if(!b.text()。trim()。isEmpty())
foundNonEmpty = true;
}
sb.append(currentSibling.toString());
currentSibling = currentSibling.nextSibling();
}

return sb.toString();
}


I am not able to parse some text for following scenarios using Jsoup Java Library.

1 : This is <b>My Text</b> some other <b> </b> text as well <b></b><b>non empty tag1</b> other text.

Expected output : some other <b> </b> text as well <b></b>

2 : This is <b>My Text</b> some other <b> </b> text as well <b></b><b>non empty tag2</b> other text.

Expected output : some other <b> </b> text as well <b></b>

3 : This is <b>My Text</b> some other <b> </b> text as well <b></b><b>non empty tag2</b> other text <b></b> <b>non empty tag3</b>.

Expected output : some other <b> </b> text as well <b></b>

Here, if you have noticed the text My Text is fix (static) but the second non empty (don't consider space as value) B tag value may vary. The regex should be able to extract the text between the <b>My Text</b> and the first occurrence non empty <b> tag after that.

I am using Jsoup library, but not able to achieve the above expected output. Please make sure that solution should be common for each scenario, because it's dynamic in my case.

解决方案

Simple solution could look like

  • find <b> element which you are interested in (the one with text you are looking for)
  • iterate over siblings placed after it and print them until you find non empty <b>

You just need to remember that Jsoup is using Node to store all elements (including text which doesn't belong to tags), while Element class (which extends Node) may contain only specific tags.

So for instance text like

before <b>bold</b> after<i>italic</i>

will be represented as

<node>before </node>
<element tag="B">
   <node>bold</node>
</element>
<node> after</node>
<element tag="I">
   <node>italic</node>
</element>

So if for instance you select("b") (which will find <element tab="B">) and call nextElementSibling() it will move you to <element tag="I">. To get <node>after</node> you will need to use nextSibling() which doesn't eliminate simple text nodes.

Possible problem with Node class is that it doesn't provide text() method which can generate textual content of current node (which could allow us to test if current node/element has any text). But nothing stops us from casting Node which handles tag to Element which provides such method.

So our solution could look like:

public static String findFragment(String html, String fixedStart) {

    Document doc = Jsoup.parse(html);
    Element myBTag = doc
            .select("b:matches(^" + Pattern.quote(fixedStart) + "$)")
            .first();

    StringBuilder sb = new StringBuilder();
    boolean foundNonEmpty = false;

    Node currentSibling = myBTag.nextSibling();
    while (currentSibling != null && !foundNonEmpty) {
        if (currentSibling.nodeName().equals("b")) {
            Element b = (Element) currentSibling;
            if (!b.text().trim().isEmpty())
                foundNonEmpty = true;
        }
        sb.append(currentSibling.toString());
        currentSibling = currentSibling.nextSibling();
    }

    return sb.toString();
}

这篇关于无法用Jsoup HTML解析器Java实现某些功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆