什么是最快的方式来获取java的csv文件的维度 [英] what is the fastest way to get dimensions of a csv file in java

查看:119
本文介绍了什么是最快的方式来获取java的csv文件的维度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


  1. 获取其中包含的行数:

我使用一个while循环来读取每一行并在每次成功读取时向上计数。缺点是读取整个文件只需要时间来计算它有多少行。


  1. 然后get它有多少列:
    我使用 String [] temp = lineOfText.split(,); 然后取大小temp。

有没有更聪明的方法?像:

file1 = read.csv;

xDimention = file1.xDimention;

yDimention = file1.yDimention;

解决方案

您的方法不适用于多行值(您将获得无效的行数)和可能包含deliminter的引号值(您将获得无效的列数)。



您应该使用CSV解析器,例如 univocity提供的解析器-parsers



使用uniVocity CSV解析器,确定尺寸的最快方法是使用以下代码。它解析 150MB文件,在 1.2秒中提供其尺寸:

  //让我们创建我们自己的RowProcessor来分析行
static class CsvDimension extends AbstractRowProcessor {

int lastColumn = -1;
long rowCount = 0;

@Override
public void rowProcessed(String [] row,ParsingContext context){
rowCount ++;
if(lastColumn< row.length){
lastColumn = row.length;
}
}
}

public static void main(String ... args)throws FileNotFoundException {
//让我们大致测量时间
long start = System.currentTimeMillis();

//创建我们自己的自定义RowProcessor的实例,如上定义。
CsvDimension myDimensionProcessor = new CsvDimension();

CsvParserSettings设置=新的CsvParserSettings();

//这告诉解析器没有行应该有超过2,000,000列
settings.setMaxColumns(2000000);

//这里你可以选择你感兴趣的列索引。
//解析器将按照您定义的顺序返回所选列的值
//通过在此处不选择索引,将不会创建String对象
settings.selectIndexes(/ *这里没有什么*/);

//选择索引时,列将重新排序,因此它们按您定义的顺序排列。
//通过禁用列重新排序,您将获得原始行,在您没有选择的列中有nulls
settings.setColumnReorderingEnabled(false);

//我们指示解析器将所有行解析为您的自定义RowProcessor。
settings.setRowProcessor(myDimensionProcessor);

//最后,我们创建一个解析器
CsvParser parser = new CsvParser(settings);

//解析!所有行都发送到您的自定义RowProcessor(CsvDimension)
//我使用一个150万的CSV文件,有130万行。
parser.parse(new FileReader(new File(c:/tmp/worldcitiespop.txt)));

//没有其他事情可做。解析器关闭输入,为您安全地执行一切。让我们得到结果:
System.out.println(Columns:+ myDimensionProcessor.lastColumn);
System.out.println(Rows:+ myDimensionProcessor.rowCount);
System.out.println(Time taken:+(System.currentTimeMillis() - start)+ms);

}

输出将是:

 列数:7 
行数:3173959
所需时间:1279 ms

披露:我是这个库的作者。它是开源和免费的(Apache V2.0许可证)。


My regular procedure when coming to the task on getting dimensions of a csv file as following:

  1. Get how many rows it has:

I use a while loop to read every lines and count up through each successful read. The cons is that it takes time to read the whole file just to count how many rows it has.

  1. then get how many columns it has: I use String[] temp = lineOfText.split(","); and then take the size of temp.

Is there any smarter method? Like:
file1 = read.csv;
xDimention = file1.xDimention;
yDimention = file1.yDimention;

解决方案

Your approach won't work with multi-line values (you'll get an invalid number of rows) and quoted values that might happen to contain the deliminter (you'll get an invalid number of columns).

You should use a CSV parser such as the one provided by univocity-parsers.

Using the uniVocity CSV parser, that fastest way to determine the dimensions would be with the following code. It parses a 150MB file to give its dimensions in 1.2 seconds:

// Let's create our own RowProcessor to analyze the rows
static class CsvDimension extends AbstractRowProcessor {

    int lastColumn = -1;
    long rowCount = 0;

    @Override
    public void rowProcessed(String[] row, ParsingContext context) {
        rowCount++;
        if (lastColumn < row.length) {
            lastColumn = row.length;
        }
    }
}

public static void main(String... args) throws FileNotFoundException {
     // let's measure the time roughly
    long start = System.currentTimeMillis();

    //Creates an instance of our own custom RowProcessor, defined above.
    CsvDimension myDimensionProcessor = new CsvDimension();

    CsvParserSettings settings = new CsvParserSettings();

    //This tells the parser that no row should have more than 2,000,000 columns
    settings.setMaxColumns(2000000);

    //Here you can select the column indexes you are interested in reading.
    //The parser will return values for the columns you selected, in the order you defined
    //By selecting no indexes here, no String objects will be created
    settings.selectIndexes(/*nothing here*/);

    //When you select indexes, the columns are reordered so they come in the order you defined.
    //By disabling column reordering, you will get the original row, with nulls in the columns you didn't select
    settings.setColumnReorderingEnabled(false);

    //We instruct the parser to send all rows parsed to your custom RowProcessor. 
    settings.setRowProcessor(myDimensionProcessor);

    //Finally, we create a parser
    CsvParser parser = new CsvParser(settings);

    //And parse! All rows are sent to your custom RowProcessor (CsvDimension)
    //I'm using a 150MB CSV file with 1.3 million rows. 
    parser.parse(new FileReader(new File("c:/tmp/worldcitiespop.txt")));

    //Nothing else to do. The parser closes the input and does everything for you safely. Let's just get the results:
    System.out.println("Columns: " + myDimensionProcessor.lastColumn);
    System.out.println("Rows: " + myDimensionProcessor.rowCount);
    System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");

}

The output will be:

Columns: 7
Rows: 3173959
Time taken: 1279 ms

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

这篇关于什么是最快的方式来获取java的csv文件的维度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆