使用Java解析HTML数据(DOM解析) [英] Parsing HTML Data using Java (DOM parse)

查看:96
本文介绍了使用Java解析HTML数据(DOM解析)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经处理了一段时间,并没有发现Stack Overflow的任何相关内容。我正在使用一个解析器来捕获HTML代码段。基于代码(进一步下面),该文件的大小呈指数增长,并捕获我需要的字段(li),但也是非常重复的,因为它一遍又一遍地捕获相同的数据。

I've worked on this for a while and didn't find anything related on Stack Overflow. I'm using a parser that's intending on capturing snippets of HTML code. Based on the code (further below), the file grows exponentially in size and is capturing the fields (li) I need but is also very repetitive in that it's capturing the same data over and over again.

这是我正在阅读的文件(完整的文件实际上有超过100行,但在这篇文章中只包含3行):

Here's the file that I'm reading from (the full file actually has over 100 lines but only included 3 lines here for this post):

<html xlmns=http://www.w3.org/1999/xhtml>
<name>Name: J0719</name>
<bracket><description>Description: <ol><li>Hop Counts: 2</li><li>State: 3</li></eol></description></bracket> 
<name>Name: J0716</name>
<bracket><description>Description: <ol><li>Hop Counts: 3</li><li>State: 2</li></eol></description></bracket> 
<name>Name: J0718</name> 
<bracket><description>Description: <ol><li>Hop Counts: 1</li><li>State: 5</li></eol></description></bracket>
<name>Name: J0726</name>
<bracket><description>Description: <ol><li>Hop Counts: 8</li><li>State: 4</li></eol></description></bracket> 
</html>

我的完整代码在这里:

package ReadXMLFile_part2;

import java.io.*;

import org.jsoup.Jsoup;
import org.jsoup.select.Elements;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;


import java.util.Enumeration;
import java.util.logging.Level;
import java.util.logging.Logger;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class ReadXMLFile_part2 {

public static void main(String[] args) throws Exception {

PrintStream out = new PrintStream(new FileOutputStream("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/results2.xml"));
System.setOut(out);

System.out.println("*** JSOUP ***");

File input = new File("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/output2_TEST.html");
Document doc = null;
    try {
        doc = Jsoup.parse(input,"UTF-8", "http://www.w3.org/1999/xhtml" );
    } catch (IOException ex) {
        Logger.getLogger(ReadXMLFile_part2.class.getName()).log(Level.SEVERE, null, ex);
    }
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));

//For loops to capture the <li> fields in the file
Element bracket = doc.getElementsByTag("bracket").first();
Elements trs = bracket.getElementsByTag("description");
for (Element description : trs) {
    for (Element li : description.getAllElements()) {
        System.out.println(li.text());
    }
}
System.out.println();

//read a line from the console
String lineFromInput = in.readLine();

//output to the file a line
out.println(lineFromInput);                                 
out.close();    
}

}

我的问题是如何解析通过输入文件中由li标记的字段,使我的输出文件为每个li标签都有一行。理想的输出将是这样(并防止无限循环):

My question is how do I parse through the fields marked by "li" in the input file such that my output file has a new line for each "li" tag. Ideal output would be to look like this (and prevent an infinite loop):

Name: J0719
Hop Counts: 2
State: 3
Name: J0716
Hop Counts: 3
State: 2
Name: J0718
Hop Counts: 1
State: 5
Name: J0726
Hop Counts: 8
State: 4

感谢并感谢任何帮助!

Sep 2nd更新:
虽然以前的ElementSibling在单独使用时非常有用,但是我需要另一个嵌套循环排序时也试图拉出描述字段(否则previousElementSibling只是连续地拉出第一个上一个元素每次)。我发现更快的解决方法是在原始代码中改变标签,现在看起来像下面的代码:

Sep 2nd update: Although the previousElementSibling was useful when used alone but I required another nested loop of some sort when also attempting to pull out the "Description" fields (otherwise previousElementSibling just continuously pulled the first previous element each time). The much quicker workaround I found was to just change the tags around in the original code so that it now looks like the code below:

更新的XML文件:

<html xlmns=http://www.w3.org/1999/xhtml>
<bracket><li>Name: J0719</li>
<description>Description: <ol><li>Hop Counts 2</li><li>State: 3</li></eol></description></bracket>
<bracket><li>Name: J0716</li>
<description>Description: <ol><li>Hop Counts 3</li><li>State: 2</li></eol></description></bracket>
<bracket><li>Name: J0718</li>
<description>Description: <ol><li>Hop Counts 1</li><li>State: 5</li></eol></description></bracket>
<bracket><li>Name: J0719</li>
<description>Description: <ol><li>Hop Counts 8</li><li>State: 4</li></eol></description></bracket>
</html>

除了以下for循环之外,原始代码中的所有其他内容都保持不变

Aside from the following 'for' loops, everything else from the original code remained the same

//Updated Code:
//For loops to capture the (li) fields in the file
Elements brackets = doc.getElementsByTag("bracket");


    for (Element bracket : brackets) {
        Elements lis = bracket.select("li");

            for (Element li : lis){
                System.out.println(li.text());

        }
        break;
    }
    System.out.println();

唯一的其他事情是,我必须手动按停止运行按钮一段时间后执行后我看到文件大小停止增长。但是我仍然看到输出文件产生了所需的结果。

The only other thing is that I have to manually press the 'stop' running button a while later after execution after i see the file size stops growing. But i still see the output file generating the desired results.

推荐答案

如果我正确理解你的问题, 名称括号您的xml中的节点不是父节点的子节点,而是相互靠近。我想,当您有括号元素时,要获取正确的名称元素的解决方案是使用 JSOUP的DOM导航方法,即 previousElementSibling()

If I understand your problem correctly, you struggle with the fact the name and bracket nodes in your xml are not children of a parent node, but just come after each other. I think a solution to get the correct name element when you have the bracket element is to use JSOUP's DOM navigation methods, i.e. previousElementSibling()

这里你的循环可能如下所示:

Here what your loop could look like:

Elements brackets = doc.getElementsByTag("bracket");
for (Element bracket : brackets) {
    Element lis = bracket.select("li");
    Element name = bracket.previousElementSibling();
    System.out.println(name.text());
    for (Element li : lis){
      System.out.println(li.text());
    }       
}

这篇关于使用Java解析HTML数据(DOM解析)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆