将CSV文件中的大量数据有效地添加到Java中的SQLite数据库中 [英] Efficiently adding huge amounts of data from CSV files into an SQLite DB in Java
问题描述
我正在尝试将CSV文件中的值解析为SQLite DB,但是文件很大(〜2,500,000行).我将程序运行了几个小时,并打印出了要运行的位置,但是根据我的计算,该文件将需要大约100个小时才能完全解析,所以我停止了它.
I'm trying to parse values from a CSV file to a SQLite DB, however the file is quite large (~2,500,000 lines). I ran my program for a a few hours, printing where it was up to, but by my calculation, the file would have taken about 100 hours to parse completely, so I stopped it.
我将不得不至少每周一次在一个新的CSV文件上(作为上一个程序的90%左右)在后台运行此程序.我提出了一些解决方案来改进我的程序.但是我对数据库了解不多,所以我对每个解决方案都有疑问.
I'm going to have to run this program as a background process at least once a week, on a new CSV file that is around 90% similar to the previous one. I have come up with a few solutions to improve my program. However I don't know much about databases, so I have questions about each of my solutions.
-
是否有比我现有的方法更有效的读取CSV文件的方法?
Is there a more efficient way to read a CSV file than what I have already?
是否正在实例化ObjectOutputStream并将其存储为BLOB,这在计算上显着昂贵?我可以直接添加值,但是稍后使用BLOB,因此现在存储它可以避免多次实例化一个新值.
Is instantiating an ObjectOutputStream, and storing it as a BLOB significantly computationally expensive? I could directly add the values instead, but I use the BLOB later, so storing it now saves me from instantiating a new one multiple times.
连接池或以其他方式更改我使用Connection的方式会更有效吗?
Would connection pooling, or changing the way I use the Connection in some other way be more efficient?
我将URL列设置为UNIQUE,因此我可以使用INSERT或IGNORE,但是在较小的数据集(约10000行)上进行测试表明,与删除表并重新填充相比,没有任何性能提升.有没有更快速的方法来仅添加唯一值?
I'm setting the URL column as UNIQUE so I can use INSERT OR IGNORE, but testing this on smaller datasets(~10000 lines) indicates that there is no performance gain compared to dropping the table and repopulating. Is there a faster way to add only unique values?
我正在犯任何明显的错误吗? (再次,我对数据库知之甚少)
Are there any obvious mistakes I'm making? (Again, I know very little about databases)
public class Database{
public void createResultsTable(){
Statement stmt;
String sql = "CREATE TABLE results("
+ "ID INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, "
+ "TITLE TEXT NOT NULL, "
+ "URL TEXT NOT NULL UNIQUE, "
...
...
+ "SELLER TEXT NOT NULL, "
+ "BEAN BLOB);";
try {
stmt = c.createStatement();
stmt.executeUpdate(sql);
} catch (SQLException e) { e.printStackTrace();}
}
public void addCSVToDatabase(Connection conn, String src){
BufferedReader reader = null;
DBEntryBean b;
String[] vals;
try{
reader = new BufferedReader(new InputStreamReader(new FileInputStream(src), "UTF-8"));
for(String line; (line = reader.readLine()) != null;){
//Each line takes the form: "title|URL|...|...|SELLER"
vals = line.split("|");
b = new DBEntryBean();
b.setTitle(vals[0]);
b.setURL(vals[1]);
...
...
b.setSeller(vals[n]);
insert(conn, b);
}
} catch(){
}
}
public void insert(Connection conn, DBEntryBean b){
PreparedStatement pstmt = null;
String sql = "INSERT OR IGNORE INTO results("
+ "TITLE, "
+ "URL, "
...
...
+ "SELLER, "
+ "BEAN"
+ ");";
try {
pstmt = c.prepareStatement(sql);
pstmt.setString(Constants.DB_COL_TITLE, b.getTitle());
pstmt.setString(Constants.DB_COL_URL, b.getURL());
...
...
pstmt.setString(Constants.DB_COL_SELLER, b.getSeller());
// ByteArrayOutputStream baos = new ByteArrayOutputStream();
// oos = new ObjectOutputStream(baos);
// oos.writeObject(b);
// byte[] bytes = baos.toByteArray();
// pstmt.setBytes(Constants.DB_COL_BEAN, bytes);
pstmt.executeUpdate();
} catch (SQLException e) { e.printStackTrace();
} finally{
if(pstmt != null){
try{ pstmt.close(); }
catch (SQLException e) { e.printStackTrace(); }
}
}
}
}
推荐答案
代码中最大的瓶颈是您没有批处理插入操作.您应该真正调用pstmt.addBatch();
而不是pstmt.executeUpdate();
,并在有一批要插入的10K行之类的东西时执行该批处理.
The biggest bottleck in your code is that you are not batching the insert operations. You should really call pstmt.addBatch();
instead of pstmt.executeUpdate();
and execute the batch once you have something like a batch of 10K rows to insert.
在CSV解析方面,应该真正考虑使用csv库为您进行解析. Univocity-parsers 具有最快的CSV解析器,并且应该处理这250万行不到一秒钟.顺便说一下,我是这个图书馆的作者.
On the CSV parsing side should really consider using a csv library to do the parsing for you. Univocity-parsers has the fastest CSV parser around and it should process these 2.5 million lines in less than a second. I'm the author of this library by the way.
String.split()
方便但不快速.对于超过几十行的任何内容,都没有必要使用它.
String.split()
is convenient but not fast. For anything more than a few dozen rows it doesn't make sense to use this.
希望这会有所帮助.
这篇关于将CSV文件中的大量数据有效地添加到Java中的SQLite数据库中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!