提高SQLite每秒INSERT的性能 [英] Improve INSERT-per-second performance of SQLite

查看:91
本文介绍了提高SQLite每秒INSERT的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

优化SQLite非常棘手. C应用程序的大容量插入性能可以从每秒85次插入到每秒96,000次以上插入!

Optimizing SQLite is tricky. Bulk-insert performance of a C application can vary from 85 inserts per second to over 96,000 inserts per second!

背景:我们正在将SQLite用作桌面应用程序的一部分.我们将大量配置数据存储在XML文件中,这些文件会被解析并加载到SQLite数据库中,以便在初始化应用程序时进行进一步处理. SQLite非常适合这种情况,因为它速度快,不需要专门的配置,并且数据库作为单个文件存储在磁盘上.

Background: We are using SQLite as part of a desktop application. We have large amounts of configuration data stored in XML files that are parsed and loaded into an SQLite database for further processing when the application is initialized. SQLite is ideal for this situation because it's fast, it requires no specialized configuration, and the database is stored on disk as a single file.

合理性: 最初,我对看到的性能感到失望.事实证明,SQLite的性能可能有很大差异(批量插入和选择均如此) ),具体取决于数据库的配置方式以及您使用API​​的方式.弄清楚所有选项和技术是什么都不是一件容易的事,因此,我认为创建此社区Wiki条目与Stack Overflow读者共享结果以节省其他人的麻烦是审慎的做法.

Rationale: Initially I was disappointed with the performance I was seeing. It turns-out that the performance of SQLite can vary significantly (both for bulk-inserts and selects) depending on how the database is configured and how you're using the API. It was not a trivial matter to figure out what all of the options and techniques were, so I thought it prudent to create this community wiki entry to share the results with Stack Overflow readers in order to save others the trouble of the same investigations.

实验:我认为最好编写一些C代码,而不是简单地讨论一般意义上的性能提示(即使用事务!" ).并实际衡量各种选择的影响.我们将从一些简单的数据开始:

The Experiment: Rather than simply talking about performance tips in the general sense (i.e. "Use a transaction!"), I thought it best to write some C code and actually measure the impact of various options. We're going to start with some simple data:

  • A 28 MB TAB-delimited text file (approximately 865,000 records) of the complete transit schedule for the city of Toronto
  • My test machine is a 3.60 GHz P4 running Windows XP.
  • The code is compiled with Visual C++ 2005 as "Release" with "Full Optimization" (/Ox) and Favor Fast Code (/Ot).
  • I'm using the SQLite "Amalgamation", compiled directly into my test application. The SQLite version I happen to have is a bit older (3.6.7), but I suspect these results will be comparable to the latest release (please leave a comment if you think otherwise).

让我们写一些代码!

代码:一个简单的C程序,它逐行读取文本文件,将字符串拆分为值,然后将数据插入SQLite数据库.在此基准"版本的代码中,创建了数据库,但实际上不会插入数据:

The Code: A simple C program that reads the text file line-by-line, splits the string into values and then inserts the data into an SQLite database. In this "baseline" version of the code, the database is created, but we won't actually insert data:

/*************************************************************
    Baseline code to experiment with SQLite performance.

    Input data is a 28 MB TAB-delimited text file of the
    complete Toronto Transit System schedule/route info
    from http://www.toronto.ca/open/datasets/ttc-routes/

**************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include "sqlite3.h"

#define INPUTDATA "C:\\TTC_schedule_scheduleitem_10-27-2009.txt"
#define DATABASE "c:\\TTC_schedule_scheduleitem_10-27-2009.sqlite"
#define TABLE "CREATE TABLE IF NOT EXISTS TTC (id INTEGER PRIMARY KEY, Route_ID TEXT, Branch_Code TEXT, Version INTEGER, Stop INTEGER, Vehicle_Index INTEGER, Day Integer, Time TEXT)"
#define BUFFER_SIZE 256

int main(int argc, char **argv) {

    sqlite3 * db;
    sqlite3_stmt * stmt;
    char * sErrMsg = 0;
    char * tail = 0;
    int nRetCode;
    int n = 0;

    clock_t cStartClock;

    FILE * pFile;
    char sInputBuf [BUFFER_SIZE] = "\0";

    char * sRT = 0;  /* Route */
    char * sBR = 0;  /* Branch */
    char * sVR = 0;  /* Version */
    char * sST = 0;  /* Stop Number */
    char * sVI = 0;  /* Vehicle */
    char * sDT = 0;  /* Date */
    char * sTM = 0;  /* Time */

    char sSQL [BUFFER_SIZE] = "\0";

    /*********************************************/
    /* Open the Database and create the Schema */
    sqlite3_open(DATABASE, &db);
    sqlite3_exec(db, TABLE, NULL, NULL, &sErrMsg);

    /*********************************************/
    /* Open input file and import into Database*/
    cStartClock = clock();

    pFile = fopen (INPUTDATA,"r");
    while (!feof(pFile)) {

        fgets (sInputBuf, BUFFER_SIZE, pFile);

        sRT = strtok (sInputBuf, "\t");     /* Get Route */
        sBR = strtok (NULL, "\t");            /* Get Branch */
        sVR = strtok (NULL, "\t");            /* Get Version */
        sST = strtok (NULL, "\t");            /* Get Stop Number */
        sVI = strtok (NULL, "\t");            /* Get Vehicle */
        sDT = strtok (NULL, "\t");            /* Get Date */
        sTM = strtok (NULL, "\t");            /* Get Time */

        /* ACTUAL INSERT WILL GO HERE */

        n++;
    }
    fclose (pFile);

    printf("Imported %d records in %4.2f seconds\n", n, (clock() - cStartClock) / (double)CLOCKS_PER_SEC);

    sqlite3_close(db);
    return 0;
}


控件"

按原样运行代码实际上不会执行任何数据库操作,但可以使我们了解原始C文件I/O和字符串处理操作有多快.


The "Control"

Running the code as-is doesn't actually perform any database operations, but it will give us an idea of how fast the raw C file I/O and string processing operations are.

在0.94中导入了864913个记录 秒

Imported 864913 records in 0.94 seconds

太好了!只要我们实际上不执行任何插入操作,我们就可以每秒执行920,000次插入操作:-)

Great! We can do 920,000 inserts per second, provided we don't actually do any inserts :-)

我们将使用从文件中读取的值来生成SQL字符串,并使用sqlite3_exec调用该SQL操作:

We're going to generate the SQL string using the values read from the file and invoke that SQL operation using sqlite3_exec:

sprintf(sSQL, "INSERT INTO TTC VALUES (NULL, '%s', '%s', '%s', '%s', '%s', '%s', '%s')", sRT, sBR, sVR, sST, sVI, sDT, sTM);
sqlite3_exec(db, sSQL, NULL, NULL, &sErrMsg);

这将很慢,因为对于每个插入,SQL都将被编译成VDBE代码,并且每个插入将在其自己的事务中发生. 有多慢?

This is going to be slow because the SQL will be compiled into VDBE code for every insert and every insert will happen in its own transaction. How slow?

在9933.61中导入了864913个记录 秒

Imported 864913 records in 9933.61 seconds

赞! 2小时45分钟!每秒仅 85次插入.

Yikes! 2 hours and 45 minutes! That's only 85 inserts per second.

默认情况下,SQLite将评估唯一事务中的每个INSERT/UPDATE语句.如果执行大量插入操作,建议将操作包装在事务中:

By default, SQLite will evaluate every INSERT / UPDATE statement within a unique transaction. If performing a large number of inserts, it's advisable to wrap your operation in a transaction:

sqlite3_exec(db, "BEGIN TRANSACTION", NULL, NULL, &sErrMsg);

pFile = fopen (INPUTDATA,"r");
while (!feof(pFile)) {

    ...

}
fclose (pFile);

sqlite3_exec(db, "END TRANSACTION", NULL, NULL, &sErrMsg);

在38.03中导入了864913条记录 秒

Imported 864913 records in 38.03 seconds

那更好.只需将所有插入物包装在一个事务中,就可以将我们的性能提高到每秒23,000个插入物.

That's better. Simply wrapping all of our inserts in a single transaction improved our performance to 23,000 inserts per second.

使用事务是一个巨大的改进,但是如果我们反复使用相同的SQL,则为每个插入重新编译SQL语句是没有意义的.让我们使用sqlite3_prepare_v2一次编译我们的SQL语句,然后使用sqlite3_bind_text将我们的参数绑定到该语句:

Using a transaction was a huge improvement, but recompiling the SQL statement for every insert doesn't make sense if we using the same SQL over-and-over. Let's use sqlite3_prepare_v2 to compile our SQL statement once and then bind our parameters to that statement using sqlite3_bind_text:

/* Open input file and import into the database */
cStartClock = clock();

sprintf(sSQL, "INSERT INTO TTC VALUES (NULL, @RT, @BR, @VR, @ST, @VI, @DT, @TM)");
sqlite3_prepare_v2(db,  sSQL, BUFFER_SIZE, &stmt, &tail);

sqlite3_exec(db, "BEGIN TRANSACTION", NULL, NULL, &sErrMsg);

pFile = fopen (INPUTDATA,"r");
while (!feof(pFile)) {

    fgets (sInputBuf, BUFFER_SIZE, pFile);

    sRT = strtok (sInputBuf, "\t");   /* Get Route */
    sBR = strtok (NULL, "\t");        /* Get Branch */
    sVR = strtok (NULL, "\t");        /* Get Version */
    sST = strtok (NULL, "\t");        /* Get Stop Number */
    sVI = strtok (NULL, "\t");        /* Get Vehicle */
    sDT = strtok (NULL, "\t");        /* Get Date */
    sTM = strtok (NULL, "\t");        /* Get Time */

    sqlite3_bind_text(stmt, 1, sRT, -1, SQLITE_TRANSIENT);
    sqlite3_bind_text(stmt, 2, sBR, -1, SQLITE_TRANSIENT);
    sqlite3_bind_text(stmt, 3, sVR, -1, SQLITE_TRANSIENT);
    sqlite3_bind_text(stmt, 4, sST, -1, SQLITE_TRANSIENT);
    sqlite3_bind_text(stmt, 5, sVI, -1, SQLITE_TRANSIENT);
    sqlite3_bind_text(stmt, 6, sDT, -1, SQLITE_TRANSIENT);
    sqlite3_bind_text(stmt, 7, sTM, -1, SQLITE_TRANSIENT);

    sqlite3_step(stmt);

    sqlite3_clear_bindings(stmt);
    sqlite3_reset(stmt);

    n++;
}
fclose (pFile);

sqlite3_exec(db, "END TRANSACTION", NULL, NULL, &sErrMsg);

printf("Imported %d records in %4.2f seconds\n", n, (clock() - cStartClock) / (double)CLOCKS_PER_SEC);

sqlite3_finalize(stmt);
sqlite3_close(db);

return 0;

在16.27中导入了864913条记录 秒

Imported 864913 records in 16.27 seconds

好!还有更多代码(别忘了调用sqlite3_clear_bindingssqlite3_reset),但是我们的性能提高了一倍以上,达到每秒 53,000次插入.

Nice! There's a little bit more code (don't forget to call sqlite3_clear_bindings and sqlite3_reset), but we've more than doubled our performance to 53,000 inserts per second.

默认情况下,SQLite将在发出OS级写命令后暂停.这样可以保证将数据写入磁盘.通过设置synchronous = OFF,我们指示SQLite只需将数据移交给OS进行写入,然后继续.如果计算机在将数据写入磁盘之前遭受灾难性崩溃(或电源故障),则数据库文件可能会损坏:

By default, SQLite will pause after issuing a OS-level write command. This guarantees that the data is written to the disk. By setting synchronous = OFF, we are instructing SQLite to simply hand-off the data to the OS for writing and then continue. There's a chance that the database file may become corrupted if the computer suffers a catastrophic crash (or power failure) before the data is written to the platter:

/* Open the database and create the schema */
sqlite3_open(DATABASE, &db);
sqlite3_exec(db, TABLE, NULL, NULL, &sErrMsg);
sqlite3_exec(db, "PRAGMA synchronous = OFF", NULL, NULL, &sErrMsg);

在12.41中导入了864913条记录 秒

Imported 864913 records in 12.41 seconds

现在的改进幅度较小,但每秒最多可插入 69,600次.

The improvements are now smaller, but we're up to 69,600 inserts per second.

请考虑通过评估PRAGMA journal_mode = MEMORY将回滚日志存储在内存中.您的事务将更快,但是如果在事务期间断电或程序崩溃,则数据库可能会因部分完成的事务而处于损坏状态:

Consider storing the rollback journal in memory by evaluating PRAGMA journal_mode = MEMORY. Your transaction will be faster, but if you lose power or your program crashes during a transaction you database could be left in a corrupt state with a partially-completed transaction:

/* Open the database and create the schema */
sqlite3_open(DATABASE, &db);
sqlite3_exec(db, TABLE, NULL, NULL, &sErrMsg);
sqlite3_exec(db, "PRAGMA journal_mode = MEMORY", NULL, NULL, &sErrMsg);

在13.50中导入864913记录 秒

Imported 864913 records in 13.50 seconds

比以前的优化慢一点,每秒 64,000次插入.

A little slower than the previous optimization at 64,000 inserts per second.

让我们结合前面的两个优化.风险更高一些(如果发生崩溃),但是我们只是在导入数据(不运行银行):

Let's combine the previous two optimizations. It's a little more risky (in case of a crash), but we're just importing data (not running a bank):

/* Open the database and create the schema */
sqlite3_open(DATABASE, &db);
sqlite3_exec(db, TABLE, NULL, NULL, &sErrMsg);
sqlite3_exec(db, "PRAGMA synchronous = OFF", NULL, NULL, &sErrMsg);
sqlite3_exec(db, "PRAGMA journal_mode = MEMORY", NULL, NULL, &sErrMsg);

在12.00中导入了864913条记录 秒

Imported 864913 records in 12.00 seconds

太棒了!我们每秒可以执行 72,000次插入操作.

Fantastic! We're able to do 72,000 inserts per second.

只为方便起见,让我们以所有先前的优化为基础,并重新定义数据库文件名,以便我们完全在RAM中工作:

Just for kicks, let's build upon all of the previous optimizations and redefine the database filename so we're working entirely in RAM:

#define DATABASE ":memory:"

在10.94中导入了864913条记录 秒

Imported 864913 records in 10.94 seconds

将我们的数据库存储在RAM中并不是很实际,但是令人印象深刻的是,我们每秒可以执行 79,000次插入.

It's not super-practical to store our database in RAM, but it's impressive that we can perform 79,000 inserts per second.

尽管没有特别改进SQLite,但我不喜欢while循环中额外的char*赋值操作.让我们快速重构该代码,将strtok()的输出直接传递到sqlite3_bind_text(),然后让编译器尝试为我们加快速度:

Although not specifically an SQLite improvement, I don't like the extra char* assignment operations in the while loop. Let's quickly refactor that code to pass the output of strtok() directly into sqlite3_bind_text(), and let the compiler try to speed things up for us:

pFile = fopen (INPUTDATA,"r");
while (!feof(pFile)) {

    fgets (sInputBuf, BUFFER_SIZE, pFile);

    sqlite3_bind_text(stmt, 1, strtok (sInputBuf, "\t"), -1, SQLITE_TRANSIENT); /* Get Route */
    sqlite3_bind_text(stmt, 2, strtok (NULL, "\t"), -1, SQLITE_TRANSIENT);    /* Get Branch */
    sqlite3_bind_text(stmt, 3, strtok (NULL, "\t"), -1, SQLITE_TRANSIENT);    /* Get Version */
    sqlite3_bind_text(stmt, 4, strtok (NULL, "\t"), -1, SQLITE_TRANSIENT);    /* Get Stop Number */
    sqlite3_bind_text(stmt, 5, strtok (NULL, "\t"), -1, SQLITE_TRANSIENT);    /* Get Vehicle */
    sqlite3_bind_text(stmt, 6, strtok (NULL, "\t"), -1, SQLITE_TRANSIENT);    /* Get Date */
    sqlite3_bind_text(stmt, 7, strtok (NULL, "\t"), -1, SQLITE_TRANSIENT);    /* Get Time */

    sqlite3_step(stmt);        /* Execute the SQL Statement */
    sqlite3_clear_bindings(stmt);    /* Clear bindings */
    sqlite3_reset(stmt);        /* Reset VDBE */

    n++;
}
fclose (pFile);

注意:我们将回到使用真实的数据库文件.内存数据库速度很快,但不一定实用

在8.94中导入了864913条记录 秒

Imported 864913 records in 8.94 seconds

稍稍重构了参数绑定中使用的字符串处理代码,使我们每秒可以执行 96,700次插入..我可以肯定地说,这非常快.当我们开始调整其他变量(即页面大小,索引创建等)时,这将成为我们的基准.

A slight refactoring to the string processing code used in our parameter binding has allowed us to perform 96,700 inserts per second. I think it's safe to say that this is plenty fast. As we start to tweak other variables (i.e. page size, index creation, etc.) this will be our benchmark.

我希望您仍然与我同在!之所以开始,是因为SQLite的批量插入性能变化如此之大,并且并不总是很明显需要对哪些内容进行更改加快我们的运营.使用相同的编译器(和编译器选项),相同版本的SQLite和相同数据,我们对代码和SQLite的使用进行了优化,以使从最坏的情况下(每秒85次插入)变为超过96,000次插入每秒!

I hope you're still with me! The reason we started down this road is that bulk-insert performance varies so wildly with SQLite, and it's not always obvious what changes need to be made to speed-up our operation. Using the same compiler (and compiler options), the same version of SQLite and the same data we've optimized our code and our usage of SQLite to go from a worst-case scenario of 85 inserts per second to over 96,000 inserts per second!

在开始评估SELECT性能之前,我们知道我们将创建索引.在下面的答案之一中,建议进行批量插入时,插入数据后创建索引的速度更快(与先创建索引然后插入数据相反).让我们尝试一下:

Before we start measuring SELECT performance, we know that we'll be creating indices. It's been suggested in one of the answers below that when doing bulk inserts, it is faster to create the index after the data has been inserted (as opposed to creating the index first then inserting the data). Let's try:

创建索引,然后插入数据

sqlite3_exec(db, "CREATE  INDEX 'TTC_Stop_Index' ON 'TTC' ('Stop')", NULL, NULL, &sErrMsg);
sqlite3_exec(db, "BEGIN TRANSACTION", NULL, NULL, &sErrMsg);
...

在18.13中导入了864913条记录 秒

Imported 864913 records in 18.13 seconds

插入数据,然后创建索引

...
sqlite3_exec(db, "END TRANSACTION", NULL, NULL, &sErrMsg);
sqlite3_exec(db, "CREATE  INDEX 'TTC_Stop_Index' ON 'TTC' ('Stop')", NULL, NULL, &sErrMsg);

在13.66中导入了864913条记录 秒

Imported 864913 records in 13.66 seconds

正如预期的那样,如果对一列进行索引,则批量插入的速度较慢,但​​是如果在插入数据后创建索引,则批量插入确实会有所不同.我们的无索引基准是每秒96,000次插入. 先创建索引然后插入数据,每秒可以插入47,700次,而先插入数据然后创建索引,则每秒可以插入63,300次.

As expected, bulk-inserts are slower if one column is indexed, but it does make a difference if the index is created after the data is inserted. Our no-index baseline is 96,000 inserts per second. Creating the index first then inserting data gives us 47,700 inserts per second, whereas inserting the data first then creating the index gives us 63,300 inserts per second.

我很乐意为其他情况提供建议以尝试...并将很快为SELECT查询编译类似的数据.

I'd gladly take suggestions for other scenarios to try... And will be compiling similar data for SELECT queries soon.

推荐答案

几个技巧:

  1. 在事务中插入/更新.
  2. 对于较旧版本的SQLite-考虑使用较少偏执的日记模式(pragma journal_mode).有NORMAL,然后有OFF,如果您不太担心数据库可能因操作系统崩溃而损坏,则可以大大提高插入速度.如果您的应用程序崩溃了,数据应该没问题.请注意,在较新的版本中,OFF/MEMORY设置对于应用程序级崩溃并不安全.
  3. 播放页面大小也会有所不同(PRAGMA page_size).由于较大的页面保留在内存中,因此具有较大的页面大小可以使读取和写入的速度更快.请注意,您的数据库将使用更多的内存.
  4. 如果有索引,请在完成所有插入操作后考虑调用CREATE INDEX.这比创建索引然后进行插入快得多.
  5. 如果您可以同时访问SQLite,则必须非常小心,因为写入完成后整个数据库将被锁定,尽管可能有多个读取器,但写入将被锁定.通过在较新的SQLite版本中添加WAL,对此进行了一些改进.
  6. 利用节省空间的优势...较小的数据库运行速度更快.例如,如果您有键值对,请尝试尽可能将键设置为INTEGER PRIMARY KEY,它将替换表中隐含的唯一行号列.
  7. 如果您使用多个线程,则可以尝试使用共享页面缓存,它将允许在线程之间共享已加载的页面,从而避免了昂贵的I/O调用.
  8. 请勿使用!feof(file)
  1. Put inserts/updates in a transaction.
  2. For older versions of SQLite - Consider a less paranoid journal mode (pragma journal_mode). There is NORMAL, and then there is OFF, which can significantly increase insert speed if you're not too worried about the database possibly getting corrupted if the OS crashes. If your application crashes the data should be fine. Note that in newer versions, the OFF/MEMORY settings are not safe for application level crashes.
  3. Playing with page sizes makes a difference as well (PRAGMA page_size). Having larger page sizes can make reads and writes go a bit faster as larger pages are held in memory. Note that more memory will be used for your database.
  4. If you have indices, consider calling CREATE INDEX after doing all your inserts. This is significantly faster than creating the index and then doing your inserts.
  5. You have to be quite careful if you have concurrent access to SQLite, as the whole database is locked when writes are done, and although multiple readers are possible, writes will be locked out. This has been improved somewhat with the addition of a WAL in newer SQLite versions.
  6. Take advantage of saving space...smaller databases go faster. For instance, if you have key value pairs, try making the key an INTEGER PRIMARY KEY if possible, which will replace the implied unique row number column in the table.
  7. If you are using multiple threads, you can try using the shared page cache, which will allow loaded pages to be shared between threads, which can avoid expensive I/O calls.
  8. Don't use !feof(file)!

我也问过类似的问题这里.

I've also asked similar questions here and here.

这篇关于提高SQLite每秒INSERT的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆