将CSV文件中的大量数据有效地添加到Java中的SQLite数据库中 [英] Efficiently adding huge amounts of data from CSV files into an SQLite DB in Java

查看:136
本文介绍了将CSV文件中的大量数据有效地添加到Java中的SQLite数据库中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将CS​​V文件中的值解析为SQLite DB,但是文件很大(〜2,500,000行).我将程序运行了几个小时,并打印出了要运行的位置,但是根据我的计算,该文件将需要大约100个小时才能完全解析,所以我停止了它.

I'm trying to parse values from a CSV file to a SQLite DB, however the file is quite large (~2,500,000 lines). I ran my program for a a few hours, printing where it was up to, but by my calculation, the file would have taken about 100 hours to parse completely, so I stopped it.

我将不得不至少每周一次在一个新的CSV文件上(作为上一个程序的90%左右)在后台运行此程序.我提出了一些解决方案来改进我的程序.但是我对数据库了解不多,所以我对每个解决方案都有疑问.

I'm going to have to run this program as a background process at least once a week, on a new CSV file that is around 90% similar to the previous one. I have come up with a few solutions to improve my program. However I don't know much about databases, so I have questions about each of my solutions.

  • 是否有比我现有的方法更有效的读取CSV文件的方法?

  • Is there a more efficient way to read a CSV file than what I have already?

是否正在实例化ObjectOutputStream并将其存储为BLOB,这在计算上显着昂贵?我可以直接添加值,但是稍后使用BLOB,因此现在存储它可以避免多次实例化一个新值.

Is instantiating an ObjectOutputStream, and storing it as a BLOB significantly computationally expensive? I could directly add the values instead, but I use the BLOB later, so storing it now saves me from instantiating a new one multiple times.

连接池或以其他方式更改我使用Connection的方式会更有效吗?

Would connection pooling, or changing the way I use the Connection in some other way be more efficient?

我将URL列设置为UNIQUE,因此我可以使用INSERT或IGNORE,但是在较小的数据集(约10000行)上进行测试表明,与删除表并重新填充相比,没有任何性能提升.有没有更快速的方法来仅添加唯一值?

I'm setting the URL column as UNIQUE so I can use INSERT OR IGNORE, but testing this on smaller datasets(~10000 lines) indicates that there is no performance gain compared to dropping the table and repopulating. Is there a faster way to add only unique values?

我正在犯任何明显的错误吗? (再次,我对数据库知之甚少)

Are there any obvious mistakes I'm making? (Again, I know very little about databases)

public class Database{

public void createResultsTable(){
    Statement stmt;
    String sql = "CREATE TABLE results("
            + "ID       INTEGER     NOT NULL    PRIMARY KEY AUTOINCREMENT, "
            + "TITLE    TEXT        NOT NULL, "
            + "URL      TEXT        NOT NULL    UNIQUE, "
            ...
            ...
            + "SELLER   TEXT        NOT NULL, "
            + "BEAN     BLOB);";
    try {
        stmt = c.createStatement();
        stmt.executeUpdate(sql);
    } catch (SQLException e) { e.printStackTrace();}


}


public void addCSVToDatabase(Connection conn, String src){

    BufferedReader reader = null;
    DBEntryBean b;
    String[] vals;

    try{
        reader = new BufferedReader(new InputStreamReader(new FileInputStream(src), "UTF-8"));
        for(String line; (line = reader.readLine()) != null;){
            //Each line takes the form: "title|URL|...|...|SELLER"
            vals = line.split("|");

            b = new DBEntryBean();
            b.setTitle(vals[0]);
            b.setURL(vals[1]);
            ...
            ...
            b.setSeller(vals[n]);

            insert(conn, b);
        }
    } catch(){

    }
}


public void insert(Connection conn, DBEntryBean b){

    PreparedStatement pstmt = null;
    String sql = "INSERT OR IGNORE INTO results("
            + "TITLE, "
            + "URL, "
            ...
            ...
            + "SELLER, "
            + "BEAN"
            + ");";

    try {
        pstmt = c.prepareStatement(sql);
        pstmt.setString(Constants.DB_COL_TITLE, b.getTitle());      
        pstmt.setString(Constants.DB_COL_URL, b.getURL());      
        ...
        ...
        pstmt.setString(Constants.DB_COL_SELLER, b.getSeller());

        // ByteArrayOutputStream baos = new ByteArrayOutputStream();
        // oos = new ObjectOutputStream(baos);
        // oos.writeObject(b);
        // byte[] bytes = baos.toByteArray();
        // pstmt.setBytes(Constants.DB_COL_BEAN, bytes);
        pstmt.executeUpdate();

    } catch (SQLException e) { e.printStackTrace(); 
    } finally{
        if(pstmt != null){
            try{ pstmt.close(); }
            catch (SQLException e) { e.printStackTrace(); }
        }

    }
}


}

推荐答案

代码中最大的瓶颈是您没有批处理插入操作.您应该真正调用pstmt.addBatch();而不是pstmt.executeUpdate();,并在有一批要插入的10K行之类的东西时执行该批处理.

The biggest bottleck in your code is that you are not batching the insert operations. You should really call pstmt.addBatch(); instead of pstmt.executeUpdate(); and execute the batch once you have something like a batch of 10K rows to insert.

在CSV解析方面,应该真正考虑使用csv库为您进行解析. Univocity-parsers 具有最快的CSV解析器,并且应该处理这250万行不到一秒钟.顺便说一下,我是这个图书馆的作者.

On the CSV parsing side should really consider using a csv library to do the parsing for you. Univocity-parsers has the fastest CSV parser around and it should process these 2.5 million lines in less than a second. I'm the author of this library by the way.

String.split()方便但不快速.对于超过几十行的任何内容,都没有必要使用它.

String.split() is convenient but not fast. For anything more than a few dozen rows it doesn't make sense to use this.

希望这会有所帮助.

这篇关于将CSV文件中的大量数据有效地添加到Java中的SQLite数据库中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆