在C#中以编程方式检查XML文件格式正确性的最快方法是什么? [英] What is the fastest way to programmatically check the well-formedness of XML files in C#?

查看:142
本文介绍了在C#中以编程方式检查XML文件格式正确性的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的XHTML文件,这些文件是手动更新的.在更新的审阅阶段,我想以编程方式检查文件的格式是否正确. 我目前正在使用 XmlReader ,但是平均CPU所需的时间比我预期的要长得多.

I have large batches of XHTML files that are manually updated. During the review phase of the updates I would like to programmatically check the well-formedness of the files. I am currently using a XmlReader, but the time required on an average CPU is much longer than I expected.

XHTML文件的大小从4KB到40KB不等,每个文件的验证需要几秒钟的时间.检查是必不可少的,但是我想尽可能地缩短时间,因为在将文件读到下一个处理步骤时执行检查.

The XHTML files range in size from 4KB to 40KB and verifying takes several seconds per file. Checking is essential but I would like to keep the time as short as possible as the check is performed while files are being read into the next process step.

有没有更快的方法来进行简单的XML格式检查?也许使用外部XML库?

Is there a faster way of doing a simple XML well-formedness check? Maybe using external XML libraries?

我可以确认使用XmlReader验证基于XML的常规"内容的速度很快,并且如建议的那样,该问题似乎与每次验证文件时都读取XHTML DTD有关.

I can confirm that validating "regular" XML based content is lightning fast using the XmlReader, and as suggested the problem seems to be related to the fact that the XHTML DTD is read each time a file is validated.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

请注意,除了DTD之外,还下载了相应的.ent文件(xhtml-lat1.ent,xhtml-symbol.ent,xhtml-special.ent).

Note that in addition to the DTD, corresponding .ent files (xhtml-lat1.ent, xhtml-symbol.ent, xhtml-special.ent) are also downloaded.

因为完全忽略DTD并不是XHTML的选择,因为格式良好与允许的HTML实体紧密相关(例如,当我们忽略DTD时,& nbsp;会迅速引入验证错误).

Since ignoring the DTD completely is not really an option for XHTML as the well-formedness is closely linked to allowed HTML entities (e.g., a &nbsp; will promptly introduce validation errors when we ignore the DTD).

通过使用建议的自定义XmlResolver 以及DTD和实体文件的本地(嵌入)副本,解决了该问题.

The problem was solved by using a custom XmlResolver as suggested, in combination with local (embedded) copies of both the DTD and entity files.

清理代码后,我将在此处发布解决方案

I will post the solution here once I cleaned up the code

推荐答案

我希望带有while(reader.Read)() {}XmlReader是最快的托管方法.当然,读取40KB内容应该不需要 ...您正在使用的输入方法是什么?

I would expect that XmlReader with while(reader.Read)() {} would be the fastest managed approach. It certainly shouldn't take seconds to read 40KB... what is the input approach you are using?

您可能需要解决一些外部(架构等)问题吗?如果是这样,您也许可以编写一个自定义的XmlResolver(通过XmlReaderSettings进行设置),该自定义XmlResolver使用本地缓存的模式而不是远程获取...

Do you perhaps have some external (schema etc) entities to resolve? If so, you might be able to write a custom XmlResolver (set via XmlReaderSettings) that uses locally cached schemas rather than a remote fetch...

以下内容几乎可以立即达到300KB:

The following does ~300KB virtually instantly:

    using(MemoryStream ms = new MemoryStream()) {
        XmlWriterSettings settings = new XmlWriterSettings();
        settings.CloseOutput = false;
        using (XmlWriter writer = XmlWriter.Create(ms, settings))
        {
            writer.WriteStartElement("xml");
            for (int i = 0; i < 15000; i++)
            {
                writer.WriteElementString("value", i.ToString());
            }
            writer.WriteEndElement();
        }
        Console.WriteLine(ms.Length + " bytes");
        ms.Position = 0;
        int nodes = 0;
        Stopwatch watch = Stopwatch.StartNew();
        using (XmlReader reader = XmlReader.Create(ms))
        {
            while (reader.Read()) { nodes++; }
        }
        watch.Stop();
        Console.WriteLine("{0} nodes in {1}ms", nodes,
            watch.ElapsedMilliseconds);
    }

这篇关于在C#中以编程方式检查XML文件格式正确性的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆