如何在Java中读取超过100000行的Excel文件? [英] how read excel file having more than 100000 row in java?

查看:178
本文介绍了如何在Java中读取超过100000行的Excel文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用apache poi读取java中超过100000行的excel文件.但我遇到的问题很少.

I am trying read excel file having more than 100000 rows in java using apache poi. but I am encountering few problems.

1-)从excel文件中提取数据需要10到15分钟.

1-) It is taking 10 to 15 min in fetching data from excel file.

2-)运行代码时,笔记本电脑开始挂起.因此,获取数据变得很困难,然后我必须重新启动笔记本电脑.

2-) As I run my code, my laptop starts hanging. Because of that it is became difficult to fetch data and then I have to restart my laptop.

还有其他方法可以使用Java在更短的时间内从excel文件中获取数据吗?

Is there any other way by which I can fetch data from my excel file in less time using java ??

这是我当前的代码:

public class ReadRfdsDump {

    public void readRfdsDump() {
        try {
            FileInputStream file = new FileInputStream(new File("C:\\Users\\esatnir\\Videos\\sprint\\sprintvision.sprint.com_Trackor Browser_RF Design Sheet_07062018122740.xlsx"));
             XSSFWorkbook workbook = new XSSFWorkbook(file);
             XSSFSheet sheet = workbook.getSheetAt(0);
             DataFormatter df = new DataFormatter();

             for(int i=0;i<2;i++) {
                 Row row= sheet.getRow(i);
                 System.out.println(df.formatCellValue(row.getCell(1)));
             }
        }catch(Exception e) {
            e.printStackTrace();
        }
    }
}

推荐答案

Apache poi的默认使用WorkbookFactory.createnew XSSFWorkbook打开工作簿将始终解析整个工作簿,包括所有工作表.如果工作簿中包含大量数据,则会导致较高的内存使用率.使用File而不是InputStream打开工作簿会减少内存使用量.但这会导致其他问题,因为使用的文件无法覆盖,至少在*.xlsx文件时不能覆盖.

Apache poi's default opening a workbook using WorkbookFactory.create or new XSSFWorkbook will always parsing the whole workbook inclusive all sheets. If the workbook contains much data this leads to high memory usage. Opening the workbook using a File instead of a InputStream decreases the memory usage. But this leads to other problems since the used file then cannot be overwritten, at least not when *.xlsx files.

XSSF和SAX(事件API)获取底层XML数据,并使用SAX进行处理.

There is XSSF and SAX (Event API) which get at the underlying XML data, and process using SAX.

但是,如果我们已经处于获取底层XML数据并进行处理的这一级别,那么我们也可以再退一步.

But if we are already at this level where we get at the underlying XML data, and process it, then we could go one more step back too.

*.xlsx文件是ZIP存档,其中包含目录结构内XML文件中的数据.因此,我们可以将*.xlsx文件解压缩,然后从XML文件中获取数据.

A *.xlsx file is a ZIP archive containing the data in XML files within a directory structure. So we can unzip the *.xlsx file and get the data from the XML files then.

其中有/xl/sharedStrings.xml,其中包含所有字符串单元格值. /xl/workbook.xml描述了工作簿的结构.并且有/xl/worksheets/sheet1.xml/xl/worksheets/sheet2.xml,...用于存储工作表的数据.并且/xl/styles.xml具有工作表中所有单元格的样式设置.

There is /xl/sharedStrings.xml having all the string cell values in it. And there is /xl/workbook.xml describing the workbook structure. And there are /xl/worksheets/sheet1.xml, /xl/worksheets/sheet2.xml, ... which are storing the sheets' data. And there is /xl/styles.xml having the style settings for all cells in the sheets.

因此,我们所需要的只是使用Java使用ZIP文件系统.使用 java.nio.file支持.文件系统.

So all we need is working with ZIP file system using Java. This is supported using java.nio.file.FileSystems.

我们需要解析XML的可能性.有包装javax.xml.stream 是我的最爱.

And we need a possibility for parsing XML. There Package javax.xml.stream is my favorite.

以下显示了工作草案.它解析/xl/sharedStrings.xml.它还解析/xl/styles.xml.但是它仅获取数字格式和单元格数字格式设置.数字格式设置对于检测日期/时间值至关重要.然后,它解析包含第一张纸的数据的/xl/worksheets/sheet1.xml.为了检测数字格式是否为日期格式,以便格式化的单元格包含日期/时间值,使用了一个单独的apache poiorg.apache.poi.ss.usermodel.DateUtil.这样做是为了简化代码.当然,即使是这堂课,我们也可以自己编写.

The following shows a working draft. It parses the /xl/sharedStrings.xml. Also it parses the /xl/styles.xml. But it gets only the number formats and the cell number format settings. The number format settings are essential for detecting date / time values. It then parses the /xl/worksheets/sheet1.xml which contains the data of the first sheet. For detecting whether a number format is a date format, and so the formatted cell contains a date / time value, one single apache poi class org.apache.poi.ss.usermodel.DateUtil is used. This is done to simplify the code. Of course even this class we could have coded ourself.

import java.nio.file.Paths;
import java.nio.file.Path;
import java.nio.file.Files;
import java.nio.file.FileSystems;
import java.nio.file.FileSystem;

import javax.xml.stream.*;
import javax.xml.stream.events.*;
import javax.xml.namespace.QName;

import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.util.HashMap;
import java.util.Date;

import org.apache.poi.ss.usermodel.DateUtil;

public class UnZipAndReadXLSXFileSystem {

 public static void main (String args[]) throws Exception {

  XMLEventReader reader = null;
  XMLEvent event = null;
  Attribute attribute = null;
  StartElement startElement = null; 
  EndElement endElement = null; 

  String characters = null;
  StringBuilder stringValue = new StringBuilder(); //for collecting the characters to complete values 

  List<String> sharedStrings = new ArrayList<String>(); //list of shared strings

  Map<String, String> numberFormats = new HashMap<String, String>(); //map of number formats
  List<String> cellNumberFormats = new ArrayList<String>(); //list of cell number formats

  Path source = Paths.get("ExcelExample.xlsx"); //path to the Excel file

  FileSystem fs = FileSystems.newFileSystem(source, null); //get filesystem of Excel file

  //get shared strings ==============================================================================
  Path sharedStringsTable = fs.getPath("/xl/sharedStrings.xml");
  reader = XMLInputFactory.newInstance().createXMLEventReader(Files.newInputStream(sharedStringsTable));
  boolean siFound = false;
  while (reader.hasNext()) {
   event = (XMLEvent)reader.next();
   if (event.isStartElement()){
    startElement = (StartElement)event;
    if (startElement.getName().getLocalPart().equalsIgnoreCase("si")) {
     //start element of shared string item
     siFound = true;
     stringValue = new StringBuilder();
    } 
   } else if (event.isCharacters() && siFound) {
    //chars of the shared string item
    characters = event.asCharacters().getData();
    stringValue.append(characters);
   } else if (event.isEndElement() ) {
    endElement = (EndElement)event;
    if (endElement.getName().getLocalPart().equalsIgnoreCase("si")) {
     //end element of shared string item
     siFound = false;
     sharedStrings.add(stringValue.toString());
    }
   }
  }
  reader.close();
System.out.println(sharedStrings);
  //shared strings ==================================================================================

  //get styles, number formats are essential for detecting date / time values =======================
  Path styles = fs.getPath("/xl/styles.xml");
  reader = XMLInputFactory.newInstance().createXMLEventReader(Files.newInputStream(styles));
  boolean cellXfsFound = false;
  while (reader.hasNext()) {
   event = (XMLEvent)reader.next();
   if (event.isStartElement()){
    startElement = (StartElement)event;
    if (startElement.getName().getLocalPart().equalsIgnoreCase("numFmt")) {
     //start element of number format
     attribute = startElement.getAttributeByName(new QName("numFmtId"));
     String numFmtId = attribute.getValue();
     attribute = startElement.getAttributeByName(new QName("formatCode"));
     numberFormats.put(numFmtId, ((attribute != null)?attribute.getValue():"null"));
    } else if (startElement.getName().getLocalPart().equalsIgnoreCase("cellXfs")) {
     //start element of cell format setting
     cellXfsFound = true;

    } else if (startElement.getName().getLocalPart().equalsIgnoreCase("xf") && cellXfsFound ) {
     //start element of format setting in cell format setting
     attribute = startElement.getAttributeByName(new QName("numFmtId"));
     cellNumberFormats.add(((attribute != null)?attribute.getValue():"null"));
    }
   } else if (event.isEndElement() ) {
    endElement = (EndElement)event;
    if (endElement.getName().getLocalPart().equalsIgnoreCase("cellXfs")) {
     //end element of cell format setting
     cellXfsFound = false;
    }
   }
  }
  reader.close();
System.out.println(numberFormats);
System.out.println(cellNumberFormats);
  //styles ==========================================================================================

  //get sheet data of first sheet ===================================================================
  Path sheet1 = fs.getPath("/xl/worksheets/sheet1.xml");
  reader = XMLInputFactory.newInstance().createXMLEventReader(Files.newInputStream(sheet1));
  boolean rowFound = false;
  boolean cellFound = false;
  boolean cellValueFound = false;
  boolean inlineStringFound = false; 
  String cellStyle = null;
  String cellType = null;
  while (reader.hasNext()) {
   event = (XMLEvent)reader.next();
   if (event.isStartElement()){
    startElement = (StartElement)event;
    if (startElement.getName().getLocalPart().equalsIgnoreCase("row")) {
     //start element of row
     rowFound = true;
System.out.print("<Row");

     attribute = startElement.getAttributeByName(new QName("r"));
System.out.print(" r=" + ((attribute != null)?attribute.getValue():"null"));
System.out.println(">");

    } else if (startElement.getName().getLocalPart().equalsIgnoreCase("c") && rowFound) {
     //start element of cell in row
     cellFound = true;
System.out.print("<Cell");

     attribute = startElement.getAttributeByName(new QName("r"));
System.out.print(" r=" + ((attribute != null)?attribute.getValue():"null"));

     attribute = startElement.getAttributeByName(new QName("t"));
System.out.print(" t=" + ((attribute != null)?attribute.getValue():"null"));
     cellType = ((attribute != null)?attribute.getValue():null);

     attribute = startElement.getAttributeByName(new QName("s"));
System.out.print(" s=" + ((attribute != null)?attribute.getValue():"null"));
     cellStyle = ((attribute != null)?attribute.getValue():null);

System.out.print(">");

    } else if (startElement.getName().getLocalPart().equalsIgnoreCase("v") && cellFound) {
     //start element of value in cell
     cellValueFound = true;
System.out.print("<V>");
     stringValue = new StringBuilder();

    } else if (startElement.getName().getLocalPart().equalsIgnoreCase("is") && cellFound) {
     //start element of inline string in cell
     inlineStringFound = true;
System.out.print("<Is>");
     stringValue = new StringBuilder();

    }
   } else if (event.isCharacters() && cellFound && (cellValueFound || inlineStringFound)) {
    //chars of the cell value or the inline string
    characters = event.asCharacters().getData();
    stringValue.append(characters);

   } else if (event.isEndElement()) {
    endElement = (EndElement)event;
    if (endElement.getName().getLocalPart().equalsIgnoreCase("row")) {
     //end element of row
     rowFound = false;
System.out.println("</Row>");

    } else if (endElement.getName().getLocalPart().equalsIgnoreCase("c")) {
     //end element of cell
     cellFound = false;
System.out.println("</Cell>");

    } else if (endElement.getName().getLocalPart().equalsIgnoreCase("v")) {
     //end element of value
     cellValueFound = false;

     String cellValue = stringValue.toString();

     if ("s".equals(cellType)) {
      cellValue = sharedStrings.get(Integer.valueOf(cellValue));
     }

     if (cellStyle != null) {
      int s = Integer.valueOf(cellStyle);
      String formatIndex = cellNumberFormats.get(s);
      String formatString = numberFormats.get(formatIndex);
      if (DateUtil.isADateFormat(Integer.valueOf(formatIndex), formatString)) {
       double dDate = Double.parseDouble(cellValue); 
       Date date = DateUtil.getJavaDate(dDate);
       cellValue = date.toString();
      }
     }

System.out.print(cellValue);
System.out.print("</V>");

    } else if (endElement.getName().getLocalPart().equalsIgnoreCase("is")) {
     //end element of inline string
     inlineStringFound = false;

     String cellValue = stringValue.toString();
System.out.print(cellValue);
System.out.print("</Is>");

    }
   }
  }
  reader.close();
  //sheet data ======================================================================================

  fs.close();

 }
}

这篇关于如何在Java中读取超过100000行的Excel文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆