从Excel文件中替换无效的XML字符并将其写回到磁盘会导致文件损坏在MS Excel中打开时出错 [英] Replacing Invalid XML characters from an excel file and writing it back to disk causes file is corrupted error on opening in MS Excel

查看:318
本文介绍了从Excel文件中替换无效的XML字符并将其写回到磁盘会导致文件损坏在MS Excel中打开时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有关问题的背景知识:

我们有一个ASP.NET MVC5应用程序,我们在其中使用FlexMonster在网格中显示数据.数据源是一个存储过程,它将所有数据带入UI网格,一旦用户单击导出"按钮,它将把报告导出到Excel.但是,在某些情况下,导出到excel失败. 某些数据包含无效字符,按照建议的此处

到目前为止,我的方法:

EPPlus库无法初始化工作簿,因为输入的excel文件包含一些无效的XML字符.我发现转储文件中包含一些无效字符.我研究了可能的方法.

首先,我在excel文件中确定了有问题的字符.我首先尝试使用Notepad ++手动将无效字符替换为空格,然后EPPlus可以成功读取文件.

现在使用其他SO线程中给出的方法此处,我替换了所有可能出现的无效字符.我目前正在使用

XmlConvert.IsXmlChar

方法找出有问题的XML字符并替换为空格.

我创建了一个示例程序,试图在有问题的Excel工作表上工作.

//in main method 
String readFile = File.ReadAllText(filePath);
string content = RemoveInvalidXmlChars(readFile);
File.WriteAllText(filePath, content);

//removal of invalid characters
        static string RemoveInvalidXmlChars(string inputText)  
        {
            StringBuilder withoutInvalidXmlCharsBuilder = new StringBuilder();
            int firstOccurenceOfRealData = inputText.IndexOf("<t>");
            int lastOccurenceOfRealData = inputText.LastIndexOf("</t>");

            if (firstOccurenceOfRealData < 0 ||
                lastOccurenceOfRealData < 0 ||
                firstOccurenceOfRealData > lastOccurenceOfRealData)
                return inputText;

            withoutInvalidXmlCharsBuilder.Append(inputText.Substring(0, firstOccurenceOfRealData)); 
            int remaining = lastOccurenceOfRealData - firstOccurenceOfRealData;
            string textToCheckFor = inputText.Substring(firstOccurenceOfRealData, remaining); 

            foreach (char c in textToCheckFor)
            {
                withoutInvalidXmlCharsBuilder.Append((XmlConvert.IsXmlChar(c)) ? c : ' ');
            }
      withoutInvalidXmlCharsBuilder.Append(inputText.Substring(lastOccurenceOfRealData));

            return withoutInvalidXmlCharsBuilder.ToString();

        }

如果我使用记事本++手动替换有问题的字符,则文件将在MSExcel中打开 fine .上面提到的代码成功替换了相同的无效字符,并将内容写回到文件中.但是,当我尝试使用MS Excel打开excel文件时,它抛出一个错误,指出该文件可能已损坏并且没有显示任何内容(下面的快照).此外,以下代码

var excelPackage = new ExcelPackage(new FileInfo(filePath));

在我通过Notepad ++更新的文件上,

引发以下异常

"CRC error: the file being extracted appears to be corrupted. Expected 0x7478AABE, Actual 0xE9191E00"}

我的问题:

  1. 我以这种方式修改内容的方法正确吗?
  2. 如果是,如何将更新的字符串写入Excel文件?
  3. 如果我的方法是错误的,那么如何继续摆脱无效的XML字符?

打开文件时显示错误(没有无效的XML字符):

首先弹出

当我单击是"时

提前谢谢!

解决方案

根据您的最新评论,它听起来确实像一个二进制文件(可能是XLSX).要确认,请使用7zip打开由FlexMonster创建的文件.如果它正确打开,并且您在文件夹中看到一堆XML文件,则为XLSX.

在那种情况下,在二进制文件上进行搜索/替换听起来像是一个非常糟糕的主意.它可能适用于XML部分,但也可能替代其他部分中的合法字符.我认为更好的方法是按照@PanagiotisKanavos的建议进行操作并使用ZipArchive.但是您必须按照正确的顺序重建它,否则Excel会抱怨.类似于在 https://stackoverflow.com/a/33312038/1324284 上执行的操作,您可以执行类似的操作这个:

public static void ReplaceXmlString(this ZipArchive xlsxZip, FileInfo outFile, string oldString, string newstring)
{
    using (var outStream = outFile.Open(FileMode.Create, FileAccess.ReadWrite))
    using (var copiedzip = new ZipArchive(outStream, ZipArchiveMode.Update))
    {
        //Go though each file in the zip one by one and copy over to the new file - entries need to be in order
        foreach (var entry in xlsxZip.Entries)
        {
            var newentry = copiedzip.CreateEntry(entry.FullName);
            var newstream = newentry.Open();
            var orgstream = entry.Open();

            //Copy non-xml files over
            if (!entry.Name.EndsWith(".xml"))
            {
                orgstream.CopyTo(newstream);
            }
            else
            {
                //Load the xml document to manipulate
                var xdoc = new XmlDocument();
                xdoc.Load(orgstream);

                var xml = xdoc.OuterXml.Replace(oldString, newstring);
                xdoc = new XmlDocument();
                xdoc.LoadXml(xml);

                xdoc.Save(newstream);
            }

            orgstream.Close();
            newstream.Flush();
            newstream.Close();
        }
    }
}

以这种方式使用时:

[TestMethod]
public void ReplaceXmlTest()
{
    var datatable = new DataTable("tblData");
    datatable.Columns.AddRange(new[]
    {
        new DataColumn("Col1", typeof (int)),
        new DataColumn("Col2", typeof (int)),
        new DataColumn("Col3", typeof (string))
    });

    for (var i = 0; i < 10; i++)
    {
        var row = datatable.NewRow();
        row[0] = i;
        row[1] = i * 10;
        row[2] = i % 2 == 0 ? "ABCD" : "AXCD";
        datatable.Rows.Add(row);
    }

    using (var pck = new ExcelPackage())
    {
        var workbook = pck.Workbook;
        var worksheet = workbook.Worksheets.Add("source");

        worksheet.Cells.LoadFromDataTable(datatable, true);
        worksheet.Tables.Add(worksheet.Cells["A1:C11"], "Table1");

        //Now similulate the copy/open of the excel file into a zip archive
        using (var orginalzip = new ZipArchive(new MemoryStream(pck.GetAsByteArray()), ZipArchiveMode.Read))
        {
            var fi = new FileInfo(@"c:\temp\ReplaceXmlTest.xlsx");
            if (fi.Exists)
                fi.Delete();

            orginalzip.ReplaceXmlString(fi, "AXCD", "REPLACED!!");
        }
    }
}

为此:

请记住,这完全是蛮力.您可以做的所有使文件筛选器更智能的工作,而不是简单地处理所有xml文件,这将是一件非常好的事情.如果这是问题所在或工作表文件夹中的xml文件,则可以将其限制为SharedString.xml文件.在不了解更多数据的情况下很难说.

A little background on problem:

We have an ASP.NET MVC5 Application where we use FlexMonster to show the data in grid. The data source is a stored procedure that brings all the data into the UI grid, and once user clicks on export button, it exports the report to Excel. However, in some cases export to excel is failing. Some of the data has some invalid characters, and it is not possible/feasible to fix the source as suggested here

My approach so far:

EPPlus library fails on initializing the workbook as the input excel file contains some invalid XML characters. I could find that the file is dumped with some invalid character in it. I looked into the possible approaches .

Firstly, I identified the problematic character in the excel file. I first tried to replace the invalid character with blank space manually using Notepad++ and the EPPlus could successfully read the file.

Now using the approaches given in other SO thread here and here, I replaced all possible occurrences of invalid chars. I am using at the moment

XmlConvert.IsXmlChar

method to find out the problematic XML character and replacing with blank space.

I created a sample program where I am trying to work on the problematic excel sheet.

//in main method 
String readFile = File.ReadAllText(filePath);
string content = RemoveInvalidXmlChars(readFile);
File.WriteAllText(filePath, content);

//removal of invalid characters
        static string RemoveInvalidXmlChars(string inputText)  
        {
            StringBuilder withoutInvalidXmlCharsBuilder = new StringBuilder();
            int firstOccurenceOfRealData = inputText.IndexOf("<t>");
            int lastOccurenceOfRealData = inputText.LastIndexOf("</t>");

            if (firstOccurenceOfRealData < 0 ||
                lastOccurenceOfRealData < 0 ||
                firstOccurenceOfRealData > lastOccurenceOfRealData)
                return inputText;

            withoutInvalidXmlCharsBuilder.Append(inputText.Substring(0, firstOccurenceOfRealData)); 
            int remaining = lastOccurenceOfRealData - firstOccurenceOfRealData;
            string textToCheckFor = inputText.Substring(firstOccurenceOfRealData, remaining); 

            foreach (char c in textToCheckFor)
            {
                withoutInvalidXmlCharsBuilder.Append((XmlConvert.IsXmlChar(c)) ? c : ' ');
            }
      withoutInvalidXmlCharsBuilder.Append(inputText.Substring(lastOccurenceOfRealData));

            return withoutInvalidXmlCharsBuilder.ToString();

        }

If I replaces the problematic character manually using notepad++, then the file opens fine in MSExcel. The above mentioned code successfully replaces the same invalid character and writes the content back to the file. However, when I try to open the excel file using MS Excel, it throws an error saying that file may have been corrupted and no content is displayed (snapshots below). Moreover, Following code

var excelPackage = new ExcelPackage(new FileInfo(filePath));

on the file that I updated via Notepad++, throws following exception

"CRC error: the file being extracted appears to be corrupted. Expected 0x7478AABE, Actual 0xE9191E00"}

My Questions:

  1. Is my approach to modify content this way correct?
  2. If yes, How can I write updated string to an Excel file?
  3. If my approach is wrong then, How can I proceed to get rid of invalid XML chars?

Errors shown on opening file (without invalid XML char):

First Pop up

When I click on yes

Thanks in advance !

解决方案

It does sounds like a binary (presumable XLSX) file based on your last comment. To confirm, open the file created by the FlexMonster with 7zip. If it opens properly and you see a bunch of XML files in folders, its a XLSX.

In that case, a search/replace on a binary file sounds like a very bad idea. It might work on the XML parts but might also replace legit chars in other parts. I think the better approach would be to do as @PanagiotisKanavos suggests and use ZipArchive. But you have to do rebuild it in the right order otherwise Excel complains. Similar to how it was done here https://stackoverflow.com/a/33312038/1324284, you could do something like this:

public static void ReplaceXmlString(this ZipArchive xlsxZip, FileInfo outFile, string oldString, string newstring)
{
    using (var outStream = outFile.Open(FileMode.Create, FileAccess.ReadWrite))
    using (var copiedzip = new ZipArchive(outStream, ZipArchiveMode.Update))
    {
        //Go though each file in the zip one by one and copy over to the new file - entries need to be in order
        foreach (var entry in xlsxZip.Entries)
        {
            var newentry = copiedzip.CreateEntry(entry.FullName);
            var newstream = newentry.Open();
            var orgstream = entry.Open();

            //Copy non-xml files over
            if (!entry.Name.EndsWith(".xml"))
            {
                orgstream.CopyTo(newstream);
            }
            else
            {
                //Load the xml document to manipulate
                var xdoc = new XmlDocument();
                xdoc.Load(orgstream);

                var xml = xdoc.OuterXml.Replace(oldString, newstring);
                xdoc = new XmlDocument();
                xdoc.LoadXml(xml);

                xdoc.Save(newstream);
            }

            orgstream.Close();
            newstream.Flush();
            newstream.Close();
        }
    }
}

When it is used like this:

[TestMethod]
public void ReplaceXmlTest()
{
    var datatable = new DataTable("tblData");
    datatable.Columns.AddRange(new[]
    {
        new DataColumn("Col1", typeof (int)),
        new DataColumn("Col2", typeof (int)),
        new DataColumn("Col3", typeof (string))
    });

    for (var i = 0; i < 10; i++)
    {
        var row = datatable.NewRow();
        row[0] = i;
        row[1] = i * 10;
        row[2] = i % 2 == 0 ? "ABCD" : "AXCD";
        datatable.Rows.Add(row);
    }

    using (var pck = new ExcelPackage())
    {
        var workbook = pck.Workbook;
        var worksheet = workbook.Worksheets.Add("source");

        worksheet.Cells.LoadFromDataTable(datatable, true);
        worksheet.Tables.Add(worksheet.Cells["A1:C11"], "Table1");

        //Now similulate the copy/open of the excel file into a zip archive
        using (var orginalzip = new ZipArchive(new MemoryStream(pck.GetAsByteArray()), ZipArchiveMode.Read))
        {
            var fi = new FileInfo(@"c:\temp\ReplaceXmlTest.xlsx");
            if (fi.Exists)
                fi.Delete();

            orginalzip.ReplaceXmlString(fi, "AXCD", "REPLACED!!");
        }
    }
}

Gives this:

Just keep in mind that this is completely brute force. Anything you can do to make the file filter smarter rather then simply doing ALL xml files would be a very good thing. Maybe limit it to the SharedString.xml file if that is where the problem lies or in the xml files in the worksheet folders. Hard to say without knowing more about the data.

这篇关于从Excel文件中替换无效的XML字符并将其写回到磁盘会导致文件损坏在MS Excel中打开时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆