使用XmlReader将大型XML文件解析为多个输出xml-获取其他所有元素 [英] Parsing a large XML file to multiple output xmls, using XmlReader - getting every other element

查看:98
本文介绍了使用XmlReader将大型XML文件解析为多个输出xml-获取其他所有元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要获取一个非常大的XML文件,并从可能是输入文件中成千上万个重复节点的位置创建多个输出xml文件.看起来像这样的源文件"AnimalBatch.xml"中没有空格:

I need to take a very large XML file and create multiple output xml files from what could be thousands of repeating nodes of the input file. There is no whitespace in the source file "AnimalBatch.xml" which looks like this:

<?xml version="1.0" encoding="utf-8" ?><Animals><Animal id="1001"><Quantity>One</Quantity><Adjective>Red</Adjective><Name>Rooster</Name></Animal><Animal id="1002"><Quantity>Two</Quantity><Adjective>Stubborn</Adjective><Name>Donkeys</Name></Animal><Animal id="1003"><Quantity>Three</Quantity><Adjective>Blind</Adjective><Name>Mice</Name></Animal><Animal id="1004"><Quantity>Four</Quantity><Adjective>Purple</Adjective><Name>Horses</Name></Animal><Animal id="1005"><Quantity>Five</Quantity><Adjective>Long</Adjective><Name>Centipedes</Name></Animal><Animal id="1006"><Quantity>Six</Quantity><Adjective>Dark</Adjective><Name>Owls</Name></Animal></Animals>

程序需要拆分重复的动物",并产生适当数量的文件,命名为:Animal_1001.xml,Animal_1002.xml,Animal_1003.xml等.

The program needs to split the repeating "Animal" and produce the appropriate number of files named: Animal_1001.xml, Animal_1002.xml, Animal_1003.xml, etc.

Animal_1001.xml:
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>One</Quantity>
<Adjective>Red</Adjective>
<Name>Rooster</Name>
</Animal>

Animal_1001.xml:
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>One</Quantity>
<Adjective>Red</Adjective>
<Name>Rooster</Name>
</Animal>

Animal_1002.xml
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>Two</Quantity>
<Adjective>Stubborn</Adjective>
<Name>Donkeys</Name>
</Animal>

Animal_1002.xml
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>Two</Quantity>
<Adjective>Stubborn</Adjective>
<Name>Donkeys</Name>
</Animal>

Animal_1003.xml>
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>Three</Quantity>
<Adjective>Blind</Adjective>
<Name>Mice</Name>
</Animal>

Animal_1003.xml>
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>Three</Quantity>
<Adjective>Blind</Adjective>
<Name>Mice</Name>
</Animal>

下面的代码有效,但仅当输入文件的<Animal id="xxxx">元素后具有CR/LF时才有效.如果它没有空白"(我没有,也不能那样得到),我就得到另一个(奇数动物)

The code below works, but only if the input file has CR/LF after the <Animal id="xxxx"> elements. If it has no "whitespace" (I don't, and can't get it like that), I get every other one (the odd numbered animals)

    static void SplitXMLReader()
    {
        string strFileName;
        string strSeq = "";

        XmlReader doc = XmlReader.Create("C:\\AnimalBatch.xml");

        while (doc.Read())
        {
            if ( doc.Name == "Animal"  && doc.NodeType == XmlNodeType.Element )
            {
                strSeq = doc.GetAttribute("id"); 

                XmlDocument outdoc = new XmlDocument();
                XmlDeclaration xmlDeclaration = outdoc.CreateXmlDeclaration("1.0", "utf-8", null);                     
                XmlElement rootNode = outdoc.CreateElement(doc.Name);

                rootNode.InnerXml = doc.ReadInnerXml();  
                // This seems to be advancing the cursor in doc too far.

                outdoc.InsertBefore(xmlDeclaration, outdoc.DocumentElement);
                outdoc.AppendChild(rootNode);

                strFileName = "Animal_" + strSeq + ".xml";
                outdoc.Save("C:\\" + strFileName);                    
            }
        }
    }

我的理解是XML中的空白"或格式应该与XmlReader没有区别-但是我已经尝试过这两种方式,在<Animal id="xxxx">之后加上或不加上CR/LF,并且可以确认是否存在差异.如果它具有CR/LF(甚至可能只有一个空格,我将在后面尝试)-它会完全处理每个<Animal>节点,并保存在id属性提供的正确文件名下.

My understanding is that "whitespace" or formatting in XML should make no difference to XmlReader - but I've tried this both ways, with and without CR/LF's after the <Animal id="xxxx">, and can confirm there is a difference. If it has CR/LFs (possibly even just a space, which I'll try next) - it gets each <Animal> node processed fully, and saved under the right filename that comes from the id attribute.

有人可以让我知道这是怎么回事-以及可能的解决方法吗?

Can someone let me know what's going on here - and a possible workaround?

推荐答案

是的,当使用doc.readInnerXml()空格时很重要.

yes, when using the doc.readInnerXml() white space is important.

从功能文档.这将返回一个字符串.因此,空白当然很重要.如果希望将内部文本作为xmlNode,则应使用

From the documentation of the function. This returns a string. so of course white space will matter. If you want the inner text as a xmlNode you should use something like this

这篇关于使用XmlReader将大型XML文件解析为多个输出xml-获取其他所有元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆