java:用charset读取大文件 [英] java: reading large file with charset

查看:148
本文介绍了java:用charset读取大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的文件是14GB,我想逐行阅读,并将导出到excel文件。

My file is 14GB and I would like to read line by line and will be export to excel file.

由于文件包含不同的语言,例如中文和英文,

我试图使用 FileInputStream UTF-16 用于读取数据,

但导致 java.lang.OutOfMemoryError :Java堆空间

我试图增加堆空间但问题仍然存在

我应该如何更改文件读取代码?

As the file include different language, such as Chinese and English,
I tried to use FileInputStream with UTF-16 for reading data,
but result in java.lang.OutOfMemoryError: Java heap space
I have tried to increase the heap space but problem still exist
How should I change my file reading code?

createExcel();     //open a excel file
try {

    //success but cannot read and output for different language
    //br = new BufferedReader(
    //        new FileReader("C:\\Users\\brian_000\\Desktop\\appdatafile.json"));


    //result in java.lang.OutOfMemoryError: Java heap space
    br = new BufferedReader(new InputStreamReader(
            new FileInputStream("C:\\Users\\brian_000\\Desktop\\appdatafile.json"), 
            "UTF-16"));

} catch (FileNotFoundException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
} catch (UnsupportedEncodingException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
} 

System.out.println("cann be print");


String line;
int i=0;
try {
    while ((line = br.readLine()) != null) {
        // process the line.
        try{
            System.out.println("cannot be print");
            //some statement for storing the data in variables.



                   //a function for writing the variable into excel
writeToExcel(platform,kind,title,shareUrl,contentRating,userRatingCount,averageUserRating
                            ,marketLanguage,pricing
                            ,majorVersionNumber,releaseDate,downloadsCount);


            }
            catch(com.google.gson.JsonSyntaxException exception){
                System.out.println("error");
            }



            // trying to get the first 1000rows
            i++;

            if(i==1000){
                br.close();

                break;
            }
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }


    closeExcel();




public static void writeToExcel(String platform,String kind,String title,String shareUrl,String contentRating,String userRatingCount,String averageUserRating
            ,String marketLanguage,String pricing,String majorVersionNumber,String releaseDate,String downloadsCount){

        currentRow++;
        System.out.println(currentRow);

        if(currentRow>1000000){
            currentsheet++;
            sheet = workbook.createSheet("apps"+currentsheet, 0);
            createFristRow();
            currentRow=1;
        }



        try {

                //character id
                Label label = new Label(0, currentRow, String.valueOf(currentRow), cellFormat);
                sheet.addCell(label);

                //12 of statements for write the data to excel
                label = new Label(1, currentRow, platform, cellFormat);
                sheet.addCell(label);




            } catch (WriteException e) {
                e.printStackTrace();
            }


推荐答案

Excel,UTF-16

Excel, UTF-16

如上所述,问题可能是由Excel文档构造引起的。尝试UTF-8是否产生较小的尺寸;例如,由于许多ASCII字符,中文HTML仍然可以用UTF-8而不是UTF-16进行更好的压缩。

As mentioned, the problem is likely caused by the Excel document construction. Try whether UTF-8 yields a lesser size; for instance Chinese HTML still is better compressed with UTF-8 rather than UTF-16 because of the many ASCII chars.

对象创建java

Object creation java

您可以分享常见的小字符串。适用于 String.valueOf(row)等。仅缓存长度较小的字符串。我假设要修复cellFormat。

You can share common small Strings. Useful for String.valueOf(row) and such. Cache only strings with a small length. I assume the cellFormat to be fixed.

DIY与xlsx

DIY with xlsx

Excel构建了一个昂贵的DOM。
如果CSV文本(带有Unicode BOM标记)没有选项(您可以给它扩展.xls由Excel打开),请尝试生成xslx。
在xslx中创建一个示例工作簿。
这是一种zip格式,您可以使用 zip文件系统在最简单的java中处理。
对于Excel,有一个内容XML和一个共享XML,使用从内容到共享字符串的索引共享单元格值。
然后在写缓冲区时不会发生溢出。
或者使用Excel的JDBC驱动程序。 (最近没有经验,可能是JDBC / ODBC。)

Excel builds a costly DOM. If CSV text (with a Unicode BOM marker) is no options (you could give it the extension .xls to be opened by Excel), try generating an xslx. Create an example workbook in xslx. This is a zip format you can process in java easiest with a zip filesystem. For Excel there is a content XML and a shared XML, sharing cell values with an index from content to shared strings. Then no overflow happens as you write buffer-wise. Or use a JDBC driver for Excel. (No recent experience on my side, maybe JDBC/ODBC.)

最佳

Best

Excel很难用于那么多数据。考虑使用数据库进行更多工作,或者在适当的Excel文件中写入每N行。也许你以后可以导入他们在一个文档中使用java。 (我对此表示怀疑。)

Excel is hard to use with that much data. Consider more effort using a database, or write every N rows in a proper Excel file. Maybe you can later import them with java in one document. (I doubt it.)

这篇关于java:用charset读取大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆