与char编码集相关的XML解析错误 [英] XML parsing error related to char encoding set

查看:199
本文介绍了与char编码集相关的XML解析错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个有效的XML文件(有效的原因浏览器可以解析它),我尝试使用JDOM2解析。代码运行良好的其他xml文件,但对于这个特定的xml文件,它给我以下的异常builder.build()行:com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:无效的字节3字节的UTF-8序列。

I have an valid XML file(valid cause browser can parse it) that I try to parse using JDOM2. The code was running good for other xml files but for this particular xml file it gives me the following exception on builder.build() line : "com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence. "

我的代码如下

    import java.io.*;
    import java.util.*;
    import java.net.*;
    import org.jdom2.*;
    import org.jdom2.input.*;
    import org.jdom2.output.*;
    import org.jdom2.adapters.*;

    public class Test
    {
        public static void main(String st[])
        {
            String results="N.A.";
            SAXBuilder builder = new SAXBuilder();
            Document doc;
            results = scrapeSite().trim();

                    try
                    {
                        doc = builder.build(new ByteArrayInputStream(results.getBytes()));
                    }
                    catch(JDOMException e)
                    {
                        System.out.println(e.toString());
                    }
                    catch(IOException e)
                    {
                        System.out.println(e.toString());
                    }
        }


        public static String scrapeSite()
        {
            String temp="";
            try
            {
                URL url = new URL("http://msu-footprints.org/2011/Aditya/search_5.xml");
                URLConnection conn = url.openConnection();
                conn.setAllowUserInteraction(false);
                InputStream urlStream = url.openStream();
                BufferedReader br = new BufferedReader(new InputStreamReader(urlStream));

                String t = br.readLine();
                while(t!=null)
                {
                    temp = temp + t;
                    t = br.readLine();
                }
            }
            catch(IOException e)
            {
                System.out.println(e.toString());
            }

            return temp;
        }
    }


推荐答案

你是用阅读器读xml到一个字符串?您在解析xml之前会损坏它。将xml视为字节,而不是字符。

why are you reading the xml into a String with a Reader? you are corrupting the xml before you parse it. treat xml as bytes, not chars.

为什么要读取整个URL InputStream只是将其转换为另一个ByteArrayInputStream?您可以通过将URL InputStream直接传递给构建器,将其减少到大约2行代码。 (不要提及避免将整个流读入内存而导致的额外内存问题)。

and why are you reading the whole URL InputStream just to convert it into another ByteArrayInputStream? you can reduce that to about 2 lines of code by passing the URL InputStream directly to the builder. (not mention avoid additional memory issues caused by reading the entire stream into memory).

这篇关于与char编码集相关的XML解析错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆