Java的:xml文件中跳过的二进制数据在解析 [英] Java: skip binary data in xml file while parsing

查看:510
本文介绍了Java的:xml文件中跳过的二进制数据在解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想分析Java中的XML文件,其中包含二进制数据:这里是XML文件的例子:

I want to parse a xml file in java which contains binary data: here is an example of the xml file:

<?xml version="1.0" encoding="utf-8"?>
<documents>
  <document>
    <element name="docid">
      <value><![CDATA[0902307e8004c74c]]></value>
    </element>
    <element name="published">
      <value><![CDATA[2012-01-01T00:00:00]]></value>
    </element>
    <element name="documenttype">
      <value><![CDATA[Circular]]></value>
    </element>
    <element name="data">
      <value><![CDATA[%PDF-1.6
%����
1020 0 obj
<</Filter/FlateDecode/First 20/Length 270/N 3/Type/ObjStm>>stream
�o^���)|�,�Ypoef�
l���o�>����u���b"Cb�|���%&��D�yD��q�q�q�q�q��%_ja�LJob��/��3"=����o���]V11}�    }a�+'6@����C�,^}�d%�۠�`s��q��5�׷^(�N��{S<S�����A��������-������f\ڌ��|U/݌�z���f�I9����g�g���s���0z'��X~
endstream
endobj
startxref
55097
%%EOF
]]></value>
    </element>
    <element name="dataname">
      <value><![CDATA[sdfsfsfsdsdfsd.pdf]]></value>
    </element>
  </document>
</documents>

通常我会解析这样的XML文件方式:

Normally I would parse such an xml file that way:

Document doc = null;
DocumentBuilder documentBuilder = null;
documentBuilderFactory = DocumentBuilderFactory.newInstance();
        try {
            documentBuilder = documentBuilderFactory.newDocumentBuilder();
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        }
try {

            doc = documentBuilder.parse(fastXMLFile);

        } catch (SAXException e) {
            System.out.println("SAXExept");
            e.printStackTrace();
        } catch (IOException e) {
            System.out.println("Test");
            return;
        }

但由于其中包含二进制数据的数据元素,调试器告诉我:

But because of the "data" element which contains binary data, the debugger tells me:

[Fatal Error] xmlfile.xml:58:10: An invalid XML character (Unicode: 0x1a) was found in the CDATA section.
SAXExept
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1a) was found in the CDATA section.

我不需要现在来分析此数据字段,我可以跳过它。我只是想分析数据的其余部分。这可能吗?

I dont need to parse this data field by now, I could just skip it. I just want to parse the rest of the data. Is this possible?

推荐答案

由于您的XML包含无效字符(如异常显示),你不能指望库成功地解析它。既然你不能改变的XML文件的创建过程,因为你无法看到搜索引擎的code,我相信最简单的你会从XML删除无效字符。

Since your XML includes invalid characters (as the exception shows), you can't expect libraries to parse it successfully. Since you can't change the XML file creation process, and since you can't see the code of the search engine, I believe the easiest for you will be to remove the Invalid characters from the XML.

所以这个过程将是:

1读取XML的内容为一个字符串

1- read the contents of the XML into a String

2 - 解析字符串,并删除所有无效的Charachters

2- parse the String and remove all Invalid Charachters

3写字符串回文件。或创建一个新的文件,如果你不能修改原始

3- write the String back into the file. or create a new file if you can't modify the original

4-解析修改/新文件。

4- parse the modified/new file.

为了替换无效字符,请访问以下链接,因为它也包括一个方法来做到这一点。

In order to replace invalid characters, see the following link as it also includes a method to do so.

无效XML字符:当有效UTF8并不意味着有效的XML

这篇关于Java的:xml文件中跳过的二进制数据在解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆