使用jaxb解析无效的xml-解析器可以更宽容吗? [英] parsing invalid xml using jaxb - can the parser be more lenient?

查看:101
本文介绍了使用jaxb解析无效的xml-解析器可以更宽容吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用JAXB一段时间,以解析大致如下所示的xml:

I've been using JAXB for a while now to parse xml that looks roughly like this:

<report>    <-- corresponds to a "wrapper" object that holds 
                some properties and two lists - a list of A's and list of B's
    <some tags with> general <info/>
    ...
    <A>   <-- corresponds to an "A" object with some properties
        <some tags with> info related to the <A> tag <bla/>
        ...
    <A/>
    <B>   <-- corresponds to an "B" object with some properties
        <some tags with> info related to the <B> tag <bla/>
        ...
    </B>
</report>

负责将xml编组的那边很糟糕,但超出了我的控制范围.
它通常会发送无效的xml字符和/或格式错误的xml.
我与负责人进行了交谈,并修复了许多错误,但有些错误似乎无法解决.
我希望解析器对这些错误尽可能地宽容,并且在不可能的情况下,从带有错误的xml中获取尽可能多的信息.
因此,如果xml包含100个A,而一个存在问题,我仍然希望保留其他99个.
这些是我最常见的问题:

The side responsible of marshalling the xml is terrible but is out of my control.
It often sends invalid xml chars and/or malformed xml.
I talked to the side responsible and got lots of errors fixed, but some they just can't seem to fix.
I want my parser to be as forgiveful as possible to these errors, and when it's not possible, to get as much info as possible from the the xml with the errors.
So if the xml contains 100 A's and one has a problem, I would still like to be able to keep the other 99.
These are my most common problems:

1. Some info tag inner value contains invalid chars
    <bla> invalid chars here, either control chars or just &>< </bla>
2. The root entity is missing a closing tag
    <report> ..... stuff here .... NO </report> at the end!
3. An inner entity (A/B)  is missing it's closing tag, or it's somehow malformed.
    <A> ...stuff here... <somethingMalformed_blabla_A/>
    OR
    <A> ...  Something malformed here...</A>

我希望自己能很好地解释自己.
我真的很想从这些xml中获得尽可能多的信息,即使它们有问题也是如此.
我想我需要采用一些将stax/sax与JAXB一起使用的策略,但不确定如何.
如果是100个A,那么一个A就会出现xml问题,我不介意仅将那个A抛出.
尽管我可以得到一个A对象,该对象具有可以解析直到错误的尽可能多的数据,这会好得多.

I hoped I explained myself well.
I really want to get as much info as possible from these xml's, even when they have problems.
I guess I need to employ some strategy that uses stax/sax along with JAXB but I'm not sure how.
If of 100 A's, one A has a xml problem I don't mind throwing just that A.
Although it would be much better if I could get an A object with as much data that could be parsed until the error.

推荐答案

此答案对我有帮助:

JAXB-解组XML异常

就我而言,我正在使用XML开关(-x)来分析Sysinternals自动运行工具的结果.由于结果被写入文件共享,或者由于新版本中的某些错误原因,XML会在接近尾声时出现格式错误.由于此自动运行捕获对于恶意软件调查至关重要,因此我确实需要数据.另外,从文件大小可以看出结果几乎是完整的.

In my case, I'm parsing results from Sysinternals Autoruns tool with the XML switch (-x). Either because the results were being written to a file share or for some buggy reason in the newer version, the XML would be malformed near the end. Since this Autoruns capture is critical for malware investigations, I really wanted the data. Plus I could tell from the file size that the results were nearly complete.

当您具有OP所建议的包含许多子元素的文档时,链接问题中的解决方案非常有效.特别是,Autoruns XML输出非常简单,由许多项目"组成,每个项目都包含许多带有文本的简单元素(即XJC生成的字符串属性).因此,如果最后遗漏了一些项目,那没什么大不了的……除非当然是与恶意软件有关的事情. :)

The solution in the linked question works really well when you have a document with many sub-elements as suggested by the OP. In particular, the Autoruns XML output is really simple and consists of many "items", each consisting of a many simple elements with text (i.e. String properties as generated by XJC). So if a few items are missed at the end, no big deal... unless of course it's something related to malware. :)

这是我的代码:

public class Loader {

    private List<Exception> exceptions = new ArrayList<>();

    public synchronized List<Exception> getExceptions() {
        return new ArrayList<>(exceptions);
    }

    protected void setExceptions(List<Exception> exceptions) {
        this.exceptions = exceptions;
    }

    public synchronized Autoruns load(File file, boolean attemptRecovery)
      throws LoaderException {
        Unmarshaller unmarshaller;
        try {
            JAXBContext context = newInstance(Autoruns.class);
            unmarshaller = context.createUnmarshaller();
        } catch (JAXBException ex) {
            throw new LoaderException("Could not create unmarshaller.", ex);
        }
        try {
            return (Autoruns) unmarshaller.unmarshal(file);
        } catch (JAXBException ex) {
            if (!attemptRecovery) {
                throw new LoaderException(ex.getMessage(), ex);
            }
        }
        exceptions.clear();
        Autoruns autoruns = new Autoruns();
        XMLInputFactory inputFactory = XMLInputFactory.newInstance();
        try {
            XMLEventReader eventReader = 
              inputFactory.createXMLEventReader(new FileInputStream(file));
            while (eventReader.hasNext()) {
                XMLEvent event = eventReader.peek();
                if (event.isStartElement()) {
                    StartElement start = event.asStartElement();
                    if (start.getName().getLocalPart().equals("item")) {
                         // note the try should allow processing of elements
                         // after this item in the event it is malformed
                         try {
                            JAXBElement<Autoruns.Item> jax_b = 
                              unmarshaller.unmarshal(eventReader,
                                                     Autoruns.Item.class);
                            autoruns.getItem().add(jax_b.getValue());
                        } catch (JAXBException ex) {
                            exceptions.add(ex);
                        }
                    }
                }
                eventReader.next();
            }
        } catch (XMLStreamException | FileNotFoundException ex) {
            exceptions.add(ex);
        }
        return autoruns;
    }

    public static Autoruns load(Path path) throws JAXBException {
        return load(path.toFile());
    }

    public static Autoruns load(File file) throws JAXBException {
        JAXBContext context = JAXBContext.newInstance(Autoruns.class);
        Unmarshaller unmarshaller = context.createUnmarshaller();
        return (Autoruns) unmarshaller.unmarshal(file);
    }

    public static class LoaderException extends Exception {

        public LoaderException(String message) {
            super(message);
        }

        public LoaderException(String message, Throwable cause) {
            super(message, cause);
        }
    }
}

这篇关于使用jaxb解析无效的xml-解析器可以更宽容吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆