从<!-->中提取HTML使用jsoup java对结束标记进行注释 [英] Extract HTML from <!-- --> comment to a closing tag using jsoup java

查看:174
本文介绍了从<!-->中提取HTML使用jsoup java对结束标记进行注释的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些看起来像

<!-- start content -->
<p>Blah...</p>
<dl><dd>blah</dd></dl>

我需要从注释中提取HTML到结束的dl标签.结束dl是注释之后的第一个(不确定后面是否可以有更多,但从来没有).两者之间的HTML的长度和内容是可变的,并且没有任何好的标识符.

I need to extract the HTML from the comment to a closing dl tag. The closing dl is the first one after the comment (not sure if there could be more after, but never is one before). The HTML between the two is variable in length and content and doesn't have any good identifiers.

我看到注释本身可以使用#comment节点进行选择,但是如何从注释开始以我描述的HTML结束标记来获取HTML?

I see that comments themselves can be selected using #comment nodes, but how would I get the HTML starting from a comment and ending with an HTML close tag as I've described?

这是我想出的方法,它可以工作,但是显然不是最有效的.

Here's what I've come up with, which works, but obviously not the most efficient.

    String myDirectoryPath = "D:\\Path";
    File dir = new File(myDirectoryPath);
    Document myDoc;
    Pattern p = Pattern.compile("<!--\\s*start\\s*content\\s*-->([\\S\\s]*?)</\\s*dl\\s*>");
    for (File child : dir.listFiles()) {
        System.out.println(child.getAbsolutePath()); 
        File file = new File(child.getAbsolutePath());
        String charSet = "UTF-8";
        String innerHtml = Jsoup.parse(file,charSet).select("body").html();
        Matcher m = p.matcher(innerHtml);
        if (m.find()) {
            Document doc = Jsoup.parse(m.group(1)); 
            String myText = doc.text();
            try {
                PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("D:\\Path\\combined.txt", true)));
                out.println(myText);
                out.close();
            } catch (IOException e) {
                //error                }
        }
    }

推荐答案

下面是一些示例代码-可能需要进一步改进-取决于您要执行的操作.

Here's some example code - it may need further improvements - depending on what you want to do.

final String html = "<p>abc</p>" // Additional tag before the comment
        + "<!-- start content -->\n"
        + "<p>Blah...</p>\n"
        + "<dl><dd>blah</dd></dl>"
        + "<p>def</p>"; // Additional tag after the comment

// Since it's not a full Html document (header / body), you may use a XmlParser
Document doc = Jsoup.parse(html, "", Parser.xmlParser());


for( Node node : doc.childNodes() ) // Iterate over all elements in the document
{
    if( node.nodeName().equals("#comment") ) // if it's a comment we do something
    {
        // Some output for testing ...
        System.out.println("=== Comment =======");
        System.out.println(node.toString().trim()); // 'toString().trim()' is only out beautify
        System.out.println("=== Childs ========");


        // Get the childs of the comment --> following nodes
        final List<Node> childNodes = node.siblingNodes();

        // Start- and endindex for the sublist - this is used to skip tags before the actual comment node
        final int startIdx = node.siblingIndex();   // Start index - start after (!) the comment node
        final int endIdx = childNodes.size();       // End index - the last following node

        // Iterate over all nodes, following after the comment
        for( Node child : childNodes.subList(startIdx, endIdx) )
        {
            /*
             * Do whatever you have to do with the nodes here ...
             * In this example, they are only used as Element's (Html Tags)
             */
            if( child instanceof Element )
            {
                Element element = (Element) child;

                /*
                 * Do something with your elements / nodes here ...
                 * 
                 * You can skip e.g. 'p'-tag by checking tagnames.
                 */
                System.out.println(element);

                // Stop after processing 'dl'-tag (= closing 'dl'-tag)
                if( element.tagName().equals("dl") )
                {
                    System.out.println("=== END ===========");
                    break;
                }
            }
        }
    }
}

为便于理解,该代码非常详细,您可以在某些时候将其缩短.

For easier understanding, the code is very detailed, you can shorten it at some points.

最后,这是此示例的输出:

And finally, here's the output of this example:

=== Comment =======
<!-- start content -->
=== Childs ========
<p>Blah...</p>
<dl>
 <dd>
  blah
 </dd>
</dl>
=== END ===========

顺便说一句.要获取评论文字,只需将其投射到Comment:

Btw. to get the text of the comment, just cast it to Comment:

String commentText = ((Comment) node).getData();

这篇关于从&lt;!--&gt;中提取HTML使用jsoup java对结束标记进行注释的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆