读通过的OpenXML一个大文件 [英] Reading a large Excel file by OpenXML

查看:249
本文介绍了读通过的OpenXML一个大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我工作的一个项目中,我是在VB.Net使用的OpenXML SDK的Excel文件中读取一些细胞并将其存储在数据表中。 它工作得很好,为中型和小文件 然而,当,当我尝试打开大小107MB的大文件,我通过几页看完后得到一个内存不足的异常。我可以打开通过双击该文件(它需要一段时间,虽然) 下面是我使用code。请让我知道,如果我可以通过减少内存消耗提高阅读的过程

I am working on a project in which i am reading few cells from an excel file using OpenXML SDK in VB.Net and storing it in DataTable. It works just fine for the medium sized and small files however when when i try to open a large file of size 107MB, i get an OutOfMemory exception after reading through few sheets. I am able to open the file by double clicking(it takes a while though) Below are the code that i am using. Please let me know if i can improve the process of reading by reducing the memory consumption

Dim CellRage As String() = {"AG65", "AG281", "AG335", "AG389", "AG443", "AG497", "AG551", "AG800", "AG913", "AG1081", "AG1165", "AG1305"}
Dim CellValue(13) As String

Using myWorkbook As SpreadsheetDocument = SpreadsheetDocument.Open(stream, False)
     workbookPart = myWorkbook.WorkbookPart

     For Each worksheetpart As WorksheetPart In workbookPart.WorksheetParts
       For count As Integer = 0 To CellRage.GetUpperBound(0) -1
             CellValue(count) = CellValue(workbookPart, sheetName, CostCellRage(count - 2)) 
       Next
       'After few sheets throws OutofMemory Exception
     Next
End Using

Private Shared Function GetCellValue(workbookPart As WorkbookPart, sheetName As String, cellAddress As String) As String
        Dim theCell As Cell
        Dim wsPart As WorksheetPart
        Dim worksheet As Sheet
        Dim value As String
        Dim stringTablePart As SharedStringTablePart = workbookPart.SharedStringTablePart

        worksheet = workbookPart.Workbook.Descendants(Of Sheet).Where(Function(s) s.Name = sheetName).FirstOrDefault
        wsPart = CType(workbookPart.GetPartById(worksheet.Id), WorksheetPart)
        theCell = wsPart.Worksheet.Descendants(Of Cell).Where(Function(c) c.CellReference = cellAddress).FirstOrDefault

        If theCell.ChildElements.Count = 0 Then
            Return ""
        End If

        value = theCell.CellValue.Text

        If (theCell.DataType IsNot Nothing) AndAlso (theCell.DataType.ToString() = "s") Then
            value = stringTablePart.SharedStringTable.ChildElements(Int32.Parse(value)).InnerText
        End If

        Return value

    End Function

感谢您寻找到这

Thanks for looking into this

推荐答案

您的问题是,您code为读取每个表到内存被占用了过多内存最终导致你。

The problem you have is that your code is reading each sheet into memory which is eventually leading to you using too much memory.

随着加斯帕指出,你可以使用 SAX 的方法,而不是对的 DOM 方法您目前正在使用。要使用的OpenXML用SAX方法读取Excel文档,你可以使用 OpenXmlReader 类。这将通过处理XML的各个部分的时间,而不是整个DOM它应该让你处理大文件读取一个存储器也更有效的方式文件。

As Jesper points out you can use a SAX approach rather than the DOM approach you are currently using. To read an Excel document using OpenXML with a SAX approach you can use the OpenXmlReader class. This will read the file in a more memory efficient way by handling sections of the XML at a time rather than the whole DOM which should allow you to process large files.

OpenXmlReader 读取XML块文件部分的XML内容;把它像使用读文件。我们不能跳转到任意单元格,因为我们还没有阅读整个文档呢。相反,我们需要做的是阅读的每一行,忽略我们不希望行。一旦我们行,我们感兴趣的是,我们有整排的XML可用,因此在这一点上,我们可以直接跳转到该行中,我们感兴趣的细胞。

The OpenXmlReader reads the XML contents of the file part in XML chunks; think of it like reading a file using a Stream. We can't jump to an arbitrary cell as we haven't read the whole document yet. Instead what we need to do is read each row and ignore the rows we don't want. Once we have the row we are interested in we have the entire row XML available so at this point we can jump directly to the cells within that row that we are interested in.

Dim desiredColumnNumber As Integer = 33
Dim cellRange As Integer() = New Integer() {65, 281, 335, 389, 443, 497, _
551, 800, 913, 1081, 1165, 1305}

Using reader As OpenXmlReader = OpenXmlReader.Create(worksheetPart)
    While reader.Read()
        'we want to find the first row
        If reader.ElementType = GetType(Row) Then
            Do
                If Not cellRange.Contains(Convert.ToInt32(CUInt(row.RowIndex))) Then
                    'we're not interested in this row so continue
                    Continue Do
                End If

                Dim row As Row = DirectCast(reader.LoadCurrentElement(), Row)

                If row.HasChildren Then
                    'get cell in column AG
                    Dim cell As Cell = DirectCast(row.ChildElements(desiredColumnNumber), Cell)
                    'do something with the cell...
                End If
                'move to the next row
            Loop While reader.ReadNextSibling()
        End If
    End While
End Using

在code以上我已经分手的单元格区域成细胞数( desiredCellNumber )和整数数组 cellRange ),存储的每一行,我们感兴趣的(如果你不能分割这在原来的code,你将有数量转换的每个单元格引用正确的格式)。

In the code above I have split your cell ranges into a cell number (desiredCellNumber) and an array of Integer (cellRange) that stores the number of each row we are interested in (if you can't split this in your original code you'll have to convert each cell reference into the correct format).

之后,我们再检查 rowIndex位置属性的每一行上,看它是否存在于 cellRange 。如果没有,那么我们进入到下一行,但如果它是一排,我们有兴趣,我们可以通过访问 Row.ChildElements 属性来访问的细胞。在code以上,我们只关心列33所以我们看到使用该单元格的值 row.ChildElements(desiredColumnNumber)

After that we then check the RowIndex property on each row to see if it exists in the cellRange. If not then we move on to the next row but if it is a row we are interested we can access the cells by accessing the Row.ChildElements property. In the code above we are only interested in column 33 so we read that cell value using row.ChildElements(desiredColumnNumber).

顺便说一句,当我处理大文件这样我平时阅读 SharedStringTablePart 词典或类似的,然后再从那里读书的时候,我需要得到一个字符串值。很显然的内存量这会消耗你的Excel文件中的内容在很大程度上取决于所以你可能需要做一些聪明的,以尽量减少存储在共享字符串时所使用的内存。在code读取共享字符串(几乎)与上述相同:

As an aside, when I'm dealing with large files like this I usually read the SharedStringTablePart into a Dictionary or similar first and then read from the there when I need to get a string value. Obviously the amount of memory this consumes depends heavily on the contents of your Excel file so you might want to do something cleverer to minimise the memory used when storing the Shared Strings. The code for reading the shared strings is (almost) identical to the above:

Dim sharedStrings As New Dictionary(Of Integer, String)()
If stringTablePart IsNot Nothing Then
    Using reader As OpenXmlReader = OpenXmlReader.Create(stringTablePart)
        Dim i As Integer = 0
        While reader.Read()
            If reader.ElementType = GetType(SharedStringItem) Then
                Dim sharedStringItem As SharedStringItem = DirectCast(reader.LoadCurrentElement(), SharedStringItem)
                sharedStrings.Add(i), If(sharedStringItem.Text IsNot Nothing, sharedStringItem.Text.Text, String.Empty))
                i = i + 1
            End If
        End While
    End Using
End If

这篇关于读通过的OpenXML一个大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆