解析格式错误/不完整/无效的XML文件 [英] Parsing malformed/incomplete/invalid XML files

查看:317
本文介绍了解析格式错误/不完整/无效的XML文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个使用JDOM和xpath解析XML文件的进程来解析文件,如下所示:

I have a process that parses an XML file using JDOM and xpath to parse the file as shown below:

private static SAXBuilder   builder         =   null;
private static Document     doc         =   null; 
private static XPath        xpathInstance       =   null;

builder = new SAXBuilder();
Text list = null;

try {
    doc = builder.build(new StringReader(xmldocument));

} catch (JDOMException e) {
            throw new Exception(e);
} 



try {
    xpathInstance = XPath.newInstance("//book[author='Neal Stephenson']/title/text()");
    list = (Text) xpathInstance.selectSingleNode(doc);
} catch (JDOMException e) {
    throw new Exception(e);
}

以上工作正常。 xpath表达式存储在属性文件中,因此可以随时更改这些表达式。现在我必须处理更多来自遗留系统的xml文件,该系统只会以4000字节的块发送xml文件。现有处理读取4000字节块并将它们存储在Oracle数据库中,每个块作为数据库中的一行(对遗留系统进行任何更改或将块存储为数据库中的行的处理是不可能的) 。

The above works fine. The xpath expressions are stored in a properties file so these can be changed anytime. Now i have to process some more xml files that come from a legacy system that will only send the xml files in chunks of 4000 bytes. The existing processing reads the 4000 byte chunks and stores them in an Oracle database with each chunk as one row in the database (Making any changes to the legacy system or the processing that stores the chunks as rows in the database is out of the question).

我可以通过提取与特定xml文档相关的所有行并合并它们来构建完整的有效XML文档,然后使用现有的处理(如上所示)来解析xml文档。

I can build the complete valid XML document by extracting all the rows related to a specific xml document and merging them and then use the existing processing (shown above) to parse the xml document.

但事实上,我需要从XML文档中提取的数据总是在前4000个字节上。这块课程不是一个有效的XML文档,因为它不完整但会包含我需要的所有数据。由于JDOM构建器拒绝它,我无法解析一个块。

The thing is though, the data i need to extract from the XML document will always be on the first 4000 bytes. This chunk ofcourse is not a valid XML document as it will be incomplete but will contain all the data i need. I cant parse just the one chunk as the JDOM builder will reject it.

我想知道我是否可以解析格式错误的XML块而不必合并所有部分(可能会有很多部分)以获得有效的XML文档。这将节省我几次到数据库的行程,以检查一个块是否可用,并且我不必合并100个块只是为了能够使用前4000个字节。

I am wondering whether i can parse the malformed XML chunk without having to merge all parts (which could get to quite many) in order to get a valid XML document. This will save me several trips to the database to check if a chunk is available and i wont have to merge 100s of chunks only for being able to use the first 4000 bytes.

我知道我可能会使用java的字符串函数来提取相关数据但这可能是使用解析器甚至xpath吗?或者他们都希望xml文档在解析之前是一个格式良好的文档?

I know i could probably use java's string functions to extract the relevant data but is this possible using a parser or even xpath? or do they both expect the xml document to be a well formed document before it can parse it?

推荐答案

您可以尝试使用 JSoup 解析无效的XML。根据定义,XML应格式正确,否则无效且不应使用。

You could try to use JSoup to parse the invalid XML. By definition XML should be well-formed, otherwise it's invalid and should not be used.

更新 - 示例:

public static void main(String[] args) {
    for (Node node : Parser.parseFragment("<test><author name=\"Vlad\"><book name=\"SO\"/>" ,
            new Element(Tag.valueOf("p"), ""),
            "")) {
        print(node, 0);
    }
}

public static void print(Node node, int offset) {
    for (int i = 0; i < offset; i++) {
        System.out.print(" ");
    }
    System.out.print(node.nodeName());
    for (Attribute attribute: node.attributes()) {
        System.out.print(", ");
        System.out.print(attribute.getKey() + "=" + attribute.getValue());
    }
    System.out.println();
    for (Node child : node.childNodes()) {
        print(child, offset + 4);
    }
}

这篇关于解析格式错误/不完整/无效的XML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆