与文档相反,OpenXML不利于读取大型Excel文件 [英] OpenXML does not help to read large Excel files contrary to documentation

查看:95
本文介绍了与文档相反,OpenXML不利于读取大型Excel文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

文档表示:

The following code segment is used to read a very large Excel 
file using the DOM approach.

,然后举一个例子.我用它来实现读取700K行的相对较大的文件.我现在有以下代码:

and then goes an example. I use it to implement reading a relatively large file with 700K rows. I have this code by now:

using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(path, false)) 
{
    WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
    WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
    SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
    // no other code
}

启动程序时,我看到它以多快的速度(仅5秒)就用完了内存(> 1G).调试器指向以下代码行:

When I start my program, I see how quickly - just in five seconds - it runs out of memory (>1G). And the debugger points to this line of code:

SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();

因此,我需要知道OpenXML是否确实有助于读取大文件.而且,如果没有,还有什么选择(Interop不能帮助-我已经检查过了).

So, I need to know whether OpenXML really helps to read large files. And, if not, what are the alternatives (Interop does not help - I've already checked it).

编辑

另外一件神秘的事.我现在得到的这段代码:

One extra mysterious thing. This code I get by now:

OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
while (reader.Read())
{
    if (reader.ElementType == typeof(Row))
    {
        count++;
    }                                
}

count变量中为我提供了超过一百万行.但是,我在第一张纸上确实有14K,在第二张纸上有700K.这很奇怪.因此,我的另一个问题是如何使用SAX方法仅解析具有数据的行.最后一个在OpenXML上读取大型Excel文件的谜团.这个 thread 中的一个人说:结果是,由于某种原因,工作表被向后枚举(所以第一个我的三张纸中的实际上是索引3.因此,我的最后一个额外的问题是如何获取所需的纸.此刻,我使用以下代码:

gives me in the count variable over than a million of rows. However, I do have 14K on the first sheet and 700K on the second sheet. It is very strange. So, my extra question is how to parse only rows with data using SAX approach. And one final mystery of reading large Excel files on OpenXML. One guy in this thread says that: "Turns out that the worksheets are enumerated backwards for some reason (so the first of my three sheets is actually index 3". So, my final extra question is how to get the sheet you want. At this moment I use this code:

WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart =     workbookPart.WorksheetParts.First();

但是考虑到说什么,我不确定我是否真的会得到第一个工作表.

But taking into account what the say, I'm not sure that in my case I would actually get the first worksheet.

推荐答案

您似乎有几个问题,我会尽力解决.

You seem to have a few questions, I'll try and tackle them one-by-one.

因此,我需要知道OpenXML是否真的有助于读取大文件.而且,如果没有,还有什么选择(Interop不能帮助-我已经检查过了).

So, I need to know whether OpenXML really helps to read large files. And, if not, what are the alternatives (Interop does not help - I've already checked it).

是的,OpenXml SDK非常适合读取大文件,但是您可能需要使用SAX方法而不是DOM方法.在您引用的同一文档中:

Yes, the OpenXml SDK is great for reading large files but you may need to use a SAX approach rather than a DOM approach. From the same documentation you cite:

但是,DOM方法要求将整个Open XML部分加载到内存中,这在处理非常大的文件时可能会导致内存不足异常.当需要处理非常大的文件时,请考虑使用SAX.

However, the DOM approach requires loading entire Open XML parts into memory, which can cause an Out of Memory exception when you are working with really large files.... Consider using SAX when you need to handle very large files.

DOM方法将整个工作表加载到内存中,这对于大工作表可能会导致内存不足异常.使用SAX方法,您可以依次读取每个元素,从而大大减少了内存消耗.

The DOM approach loads the whole sheet into memory which for a large sheet can cause out of memory exceptions. Using the SAX approach you read each element in turn which reduces the memory consumption considerably.

所以,我的另一个问题是如何使用SAX方法仅解析具有数据的行

So, my extra question is how to parse only rows with data using SAX approach

仅使用SDK获得包含数据的行(或至少包含XML中存在的行).您似乎已将其作为一个单独的问题提出来,我已经对其进行了更详细的回答,但实际上,您正在使用问题中的代码查看每个行元素的开头和结尾.请参阅我对为什么OpenXML为什么两次读取行问题的回答,以获取更多详细信息.

You are only getting the rows that have data (or at least the rows that exist in the XML) using the SDK. You appear to have asked this as a separate question which I've answered in more detail but essentially you are seeing the start and end of each row element using the code in your question. See my answer to your Why does OpenXML read rows twice question for more details.

所以,我的最后一个额外的问题是如何获取所需的工作表.

So, my final extra question is how to get the sheet you want.

您需要按名称查找Sheet,它是Workbook的后代.一旦有了它,就可以使用其Id来获取WorksheetPart:

You need to find the Sheet by name which is a descendant of the Workbook. Once you have that you can use its Id to get the WorksheetPart:

using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filename, false))
{
    WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;

    Sheet sheet = workbookPart.Workbook.Descendants<Sheet>().Where(s => s.Name == sheetName).First();
    if (sheet != null)
    {
        WorksheetPart worksheetPart = workbookPart.GetPartById(sheet.Id) as WorksheetPart;

        //read worksheetPart...
    }
}

这篇关于与文档相反,OpenXML不利于读取大型Excel文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆