VTD-XML似乎在XML文档中破坏转义的字符串 [英] VTD-XML seems to be spoiling escaped string in XML document

查看:207
本文介绍了VTD-XML似乎在XML文档中破坏转义的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个XML数据集(可以使用DrugBank数据库 here )其中一些字段包含转义的XML字符,如&等。

I am working on an XML data set (the DrugBank database available here) where some fields contain escaped XML characters like "&", etc.

为了使问题更具体,以下是一个示例场景:

To make the problem more concrete, here is an example scenario:

<drugs>
    <drug>
        <drugbank-id>DB00001</drugbank-id>
        <general-references>
            # Askari AT, Lincoff AM: Antithrombotic Drug Therapy in Cardiovascular Disease. 2009 Oct; pp. 440&#x2013;. ISBN 9781603272346. "Google books":http://books.google.com/books?id=iadLoXoQkWEC&amp;pg=PA440.
        </general-references>
        .
    </drug>
    <drug>
    ...
    </drug>
    ...
</drugs>

由于整个文档很大,我正在解析如下:

Since the entire document is huge, I am parsing it as follows:

VTDGen gen = new VTDGen();
try {
    gen.setDoc(Files.readAllBytes(DRUGBANK_XML));
    gen.parse(true);
} catch (IOException | ParseException e) {
    SystemHelper.exitWithMessage(e, "Unable to process Drugbank XML data. Aborting.");
}
VTDNav nav = gen.getNav();
AutoPilot pilot = new AutoPilot(nav);
pilot.selectXPath("//drugs/drug");
while (pilot.evalXPath() != -1) {
    long fragment = nav.getContentFragment();
    String drugXML = nav.toString((int) fragment, (int) (fragment >> 32));
    System.out.println(drugXML);
    finerParse(drugXML); // another method handling a more detailed data analysis
}

当我测试 finerParse 方法与样本xml(从相同的数据复制的片段),它工作正常。但是从上面的代码调用时,它失败了错误消息 Entity中的错误:非法实体char 。输入到 finerParse (即 drugXML string)时,我注意到字符串& amp; pg = PA440 更改为& pg = PA440。

When I tested the finerParse method with sample xml (snippets copy-pasted from the same data), it worked fine. But when called from the above code, it failed with the error message Errors in Entity: Illegal entity char. Upon printing the input to finerParse (i.e., the drugXML string), I noticed that the string &amp;pg=PA440 in the original xml was changed to "&pg=PA440".

为什么会这样?我所做的就是使用一个非常有名的解析器解析它。

Why is this happening? All I am doing is parsing it using with a very well known parser.

我已经找到了一个替代解决方案,我只是将VTDNav作为参数传递给 finerParse ,而不是首先获取内容字符串并传递该字符串。但是我仍然很好奇上述方法出了什么问题。

P.S. I have found an alternate solution where I am simply passing the VTDNav as the argument to finerParse instead of first obtaining the content string and passing that string. But I am still curious about what is going wrong with the above approach.

推荐答案

而不是vtdNav.toString()使用vtdNav。 toRawString()问题应该消失...让我知道,如果它是否工作。

Instead of vtdNav.toString() use vtdNav.toRawString() the problem should go away...let me know if it works or not.

这篇关于VTD-XML似乎在XML文档中破坏转义的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆