如何提高 Python 3.6 中的 SQLite 插入性能? [英] How to improve SQLite insert performance in Python 3.6?

查看:171
本文介绍了如何提高 Python 3.6 中的 SQLite 插入性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景

我想使用 Python 向 SQLite 插入 100 万条记录.我尝试了多种方法来改进它,但仍然不太满意.数据库将文件加载到内存使用 0.23 秒(在下面搜索 pass)但 SQLite 1.77 秒加载和插入到文件.

环境

英特尔酷睿 i7-7700 @ 3.6GHz
16GB 内存
美光 1100 256GB 固态硬盘,Windows 10 x64
Python 3.6.5 Minconda
sqlite3.version 2.6.0

生成数据.py

我用与我的真实数据相同的格式生成了 100 万个测试输入数据.

导入时间start_time = time.time()与 open('input.ssv', 'w') 一样:符号 = ['AUDUSD','EURUSD','GBPUSD','NZDUSD','USDCAD','USDCHF','USDJPY','USDCNY','USDHKD']行 = []对于 i 在范围内(0,1*1000*1000):q1, r1, q2, r2 = i//100000, i%100000, (i+1)//100000, (i+1)%100000line = '{} {}.{:05d} {}.{:05d}'.format(symbols[i%len(symbols)], q1, r1, q2, r2)行.追加(行)out.write('\n'.join(lines))打印(时间.时间()-开始时间,我)

input.ssv

测试数据如下所示.

AUDUSD 0.00000 0.00001欧元美元 0.00001 0.00002英镑兑美元 0.00002 0.00003纽元兑美元 0.00003 0.00004美元加元 0.00004 0.00005...美元瑞郎 9.99995 9.99996美元日元 9.99996 9.99997美元人民币 9.99997 9.99998美元港元 9.99998 9.99999澳元兑美元 9.99999 10.00000//总共 100 万行,Python 代码生成到磁盘需要 1.38 秒

Windows 正确显示 23,999,999 字节的文件大小.

基线代码 InsertData.py

导入时间类定时器:def __enter__(self):self.start = time.time()回归自我def __exit__(self, *args):elapsed = time.time()-self.start打印('以 {:.2f} 秒或 {:.0f} 每秒导入'.format(elapsed, 1*1000*1000/elapsed))使用 Timer() 作为 t:使用 open('input.ssv', 'r') 作为 infile:infile.read()

基本输入/输出

 with open('input.ssv', 'r') as infile:infile.read()

<块引用>

以 0.13 秒或每秒 7.6 M 的速度导入

测试读取速度.

 with open('input.ssv', 'r') as infile:使用 open('output.ssv', 'w') 作为输出文件:outfile.write(infile.read())//在这里插入

<块引用>

以 0.26 秒或每秒 3.84 M 的速度导入

它在不解析任何东西的情况下测试读写速度

 with open('input.ssv', 'r') as infile:行 = infile.read().splitlines()对于线中线:pass # 在这里插入

<块引用>

以 0.23 秒或每秒 4.32 M 的速度导入

当我逐行解析数据时,它实现了非常高的输出.

这让我们了解我的测试机器上的 IO 和字符串处理操作有多快.

1.写文件

outfile.write(line)

<块引用>

以 0.52 秒或每秒 1.93 M 的速度导入

2.拆分为浮动到字符串

tokens = line.split()sym,bid, ask = tokens[0], float(tokens[1]), float(tokens[2])outfile.write('{} {:.5f} {%.5f}\n'.format(sym,bid, ask))//这里真正插入

<块引用>

在 2.25 秒内导入或每秒 445 K

3.使用自动提交插入语句

conn = sqlite3.connect('example.db',isolation_level=None)c.execute("插入股票值('{}',{:.5f},{:.5f})".format(sym,bid,ask))

<块引用>

当isolation_level = None(自动提交)时,程序需要好几个小时才能完成(我等不及这么长时间了)

注意输出数据库文件大小为 32,325,632 字节,即 32MB.它比输入文件 ssv 文件大小 23MB 大 10MB.

4.使用 BEGIN (DEFERRED) 插入语句

conn = sqlite3.connect('example.db', isolation_level='DEFERRED') # 默认c.execute("插入股票值('{}',{:.5f},{:.5f})".format(sym,bid,ask))

<块引用>

在 7.50 秒内导入或每秒 133,296 个

这与编写BEGINBEGIN TRANSACTIONBEGIN DEFERRED TRANSACTION 相同,而不是BEGIN IMMEDIATE开始独家.

5.按准备好的语句插入

使用上面的事务得到了令人满意的结果,但需要注意的是,使用 Python 的字符串操作是不可取的,因为它会受到 SQL 注入.此外,与参数替换相比,使用字符串速度较慢.

c.executemany("INSERT INTO stock VALUES (?,?,?)", [(sym,bid,ask)])

<块引用>

在 2.31 秒内导入或每秒 432,124 个

6.关闭同步

在数据到达物理磁盘表面之前同步未设置为EXTRAFULL 时,电源故障会损坏数据库文件.当我们可以保证电源和操作系统健康时,我们可以将同步转为OFF,这样数据交给操作系统层后不同步.

conn = sqlite3.connect('example.db',isolation_level='DEFERRED')c = conn.cursor()c.execute('''PRAGMA 同步 = OFF''')

<块引用>

在 2.25 秒内导入或每秒 444,247 个

7.关闭日志,因此没有回滚或原子提交

在某些应用程序中,不需要数据库的回滚功能,例如时间序列数据插入.当我们可以确保电源和操作系统健康时,我们可以将 journal_mode 设置为 off,以便完全禁用回滚日志并禁用原子提交和回滚功能.>

conn = sqlite3.connect('example.db',isolation_level='DEFERRED')c = conn.cursor()c.execute('''PRAGMA 同步 = OFF''')c.execute('''PRAGMA journal_mode = OFF''')

<块引用>

在 2.22 秒内导入或每秒 450,653 个

8.使用内存数据库

在某些应用程序中不需要将数据写回磁盘,例如向 Web 应用程序提供查询数据的应用程序.

conn = sqlite3.connect(":memory:")

<块引用>

在 2.17 秒或每秒 460,405 次导入

9.循环中更快的 Python 代码

我们应该考虑将每一位计算都保存在一个密集循环中,例如避免对变量和字符串操作赋值.

9a.避免给变量赋值

tokens = line.split()c.executemany("插入股票值(?,?,?)", [(tokens[0], float(tokens[1]), float(tokens[2]))])

<块引用>

在 2.10 秒内导入或每秒 475,964 个

9b.避免 string.split()

当我们可以把空格分隔的数据当作固定宽度格式时,我们可以直接表示每个数据到数据头部的距离.这意味着 line.split()[1] 变成了 line[7:14]

c.executemany("INSERT INTO stock VALUES (?,?,?)", [(line[0:6], float(line[7:14]), float(line[15:])))])

<块引用>

在 1.94 秒内导入或每秒 514,661 个

9c.避免 float() 到 ?

当我们使用带有 ? 占位符的 executemany() 时,我们不需要事先将字符串转为浮点数.

executemany("INSERT INTO stock VALUES (?,?,?)", [(line[0:6], line[7:14], line[15:])])

<块引用>

在 1.59 秒内导入或每秒 630,520 个

10.迄今为止最快的全功能和健壮的代码

导入时间类定时器:def __enter__(self):self.start = time.time()回归自我def __exit__(self, *args):elapsed = time.time()-self.start打印('以 {:.2f} 秒或 {:.0f} 每秒导入'.format(elapsed, 1*1000*1000/elapsed))导入 sqlite3conn = sqlite3.connect('example.db')c = conn.cursor()c.execute('''DROP TABLE IF EXISTS股票''')c.execute('''CREATE TABLE 如果不存在股票(符号文本,出价真实,询问真实)''')c.execute('''PRAGMA 同步 = EXTRA''')c.execute('''PRAGMA journal_mode = WAL''')使用 Timer() 作为 t:使用 open('input.ssv', 'r') 作为 infile:行 = infile.read().splitlines()对于线中线:c.executemany("插入股票值(?,?,?)", [(line[0:6], line[7:14], line[15:])])conn.commit()conn.close()

<块引用>

导入时间为 1.77 秒或每秒 564,611 次

可以更快吗?

我有一个 23MB 的文件,其中包含 100 万条记录,由一段文本作为符号名称和 2 个浮点数作为买卖.当您搜索上面的 pass 时,测试结果显示每秒向纯文件插入 4.32 M.当我插入到一个强大的 SQLite 数据库时,它下降到每秒 0.564 M 次插入.在 SQLite 中你还能想到什么让它更快?如果不是 SQLite 而是其他数据库系统怎么办?

解决方案

如果 Python 的解释器实际上是计时(第 9 节)与 SQLite 性能的重要因素,您可能会发现 PyPy 显着提高性能(Python 的 sqlite3 接口是在纯 python 中实现的.)这里在纯 python 中做的不多,但如果你做更多的字符串操作或有 for 循环,那么从 CPython 切换是值得的.

显然,如果SQLite 之外的性能真的很重要,您可以尝试使用更快的语言(如 C/C++)编写.多线程可能有帮助,也可能无济于事,具体取决于数据库锁的实现方式.

Background

I would like to insert 1-million records to SQLite using Python. I tried a number of ways to improve it but it is still not so satisfied. The database load file to memory using 0.23 second (search pass below) but SQLite 1.77 second to load and insert to file.

Environment

Intel Core i7-7700 @ 3.6GHz
16GB RAM
Micron 1100 256GB SSD, Windows 10 x64
Python 3.6.5 Minconda
sqlite3.version 2.6.0

GenerateData.py

I generate the 1 million test input data with the same format as my real data.

import time
start_time = time.time()
with open('input.ssv', 'w') as out:
    symbols = ['AUDUSD','EURUSD','GBPUSD','NZDUSD','USDCAD','USDCHF','USDJPY','USDCNY','USDHKD']
    lines = []
    for i in range(0,1*1000*1000):
        q1, r1, q2, r2 = i//100000, i%100000, (i+1)//100000, (i+1)%100000
        line = '{} {}.{:05d} {}.{:05d}'.format(symbols[i%len(symbols)], q1, r1, q2, r2)
        lines.append(line)
    out.write('\n'.join(lines))
print(time.time()-start_time, i)

input.ssv

The test data looks like this.

AUDUSD 0.00000 0.00001
EURUSD 0.00001 0.00002
GBPUSD 0.00002 0.00003
NZDUSD 0.00003 0.00004
USDCAD 0.00004 0.00005
...
USDCHF 9.99995 9.99996
USDJPY 9.99996 9.99997
USDCNY 9.99997 9.99998
USDHKD 9.99998 9.99999
AUDUSD 9.99999 10.00000
// total 1 million of lines, taken 1.38 second for Python code to generate to disk

Windows correctly shows 23,999,999 bytes file size.

Baseline Code InsertData.py

import time
class Timer:
    def __enter__(self):
        self.start = time.time()
        return self
    def __exit__(self, *args):
        elapsed = time.time()-self.start
        print('Imported in {:.2f} seconds or {:.0f} per second'.format(elapsed, 1*1000*1000/elapsed)) 

with Timer() as t:
    with open('input.ssv', 'r') as infile:
        infile.read()

Basic I/O

with open('input.ssv', 'r') as infile:
    infile.read()

Imported in 0.13 seconds or 7.6 M per second

It tests the read speed.

with open('input.ssv', 'r') as infile:
    with open('output.ssv', 'w') as outfile:
        outfile.write(infile.read()) // insert here

Imported in 0.26 seconds or 3.84 M per second

It tests the read and write speed without parsing anything

with open('input.ssv', 'r') as infile:
    lines = infile.read().splitlines()
    for line in lines:
        pass # do insert here

Imported in 0.23 seconds or 4.32 M per second

When I parse the data line by line, it achieves a very high output.

This gives us a sense about how fast the IO and string processing operations on my testing machine.

1. Write File

outfile.write(line)

Imported in 0.52 seconds or 1.93 M per second

2. Split to floats to string

tokens = line.split()
sym, bid, ask = tokens[0], float(tokens[1]), float(tokens[2])
outfile.write('{} {:.5f} {%.5f}\n'.format(sym, bid, ask)) // real insert here

Imported in 2.25 seconds or 445 K per second

3. Insert Statement with autocommit

conn = sqlite3.connect('example.db', isolation_level=None)
c.execute("INSERT INTO stocks VALUES ('{}',{:.5f},{:.5f})".format(sym,bid,ask))

When isolation_level = None (autocommit), program takes many hours to complete (I could not wait for such a long hours)

Note the output database file size is 32,325,632 bytes, which is 32MB. It is bigger than the input file ssv file size of 23MB by 10MB.

4. Insert Statement with BEGIN (DEFERRED)

conn = sqlite3.connect('example.db', isolation_level=’DEFERRED’) # default
c.execute("INSERT INTO stocks VALUES ('{}',{:.5f},{:.5f})".format(sym,bid,ask))

Imported in 7.50 seconds or 133,296 per second

This is the same as writing BEGIN, BEGIN TRANSACTION or BEGIN DEFERRED TRANSACTION, not BEGIN IMMEDIATE nor BEGIN EXCLUSIVE.

5. Insert by Prepared Statement

Using the transaction above gives a satisfactory results but it should be noted that using Python’s string operations is undesired because it is subjected to SQL injection. Moreover using string is slow compared to parameter substitution.

c.executemany("INSERT INTO stocks VALUES (?,?,?)", [(sym,bid,ask)])

Imported in 2.31 seconds or 432,124 per second

6. Turn off Synchronous

Power failure corrupts the database file when synchronous is not set to EXTRA nor FULL before data reaches the physical disk surface. When we can ensure the power and OS is healthy, we can turn synchronous to OFF so that it doe not synchronized after data handed to OS layer.

conn = sqlite3.connect('example.db', isolation_level='DEFERRED')
c = conn.cursor()
c.execute('''PRAGMA synchronous = OFF''')

Imported in 2.25 seconds or 444,247 per second

7. Turn off journal and so no rollback nor atomic commit

In some applications the rollback function of a database is not required, for example a time series data insertion. When we can ensure the power and OS is healthy, we can turn journal_mode to off so that rollback journal is disabled completely and it disables the atomic commit and rollback capabilities.

conn = sqlite3.connect('example.db', isolation_level='DEFERRED')
c = conn.cursor()
c.execute('''PRAGMA synchronous = OFF''')
c.execute('''PRAGMA journal_mode = OFF''')

Imported in 2.22 seconds or 450,653 per second

8. Using in-memory database

In some applications writing data back to disks is not required, such as applications providing queried data to web applications.

conn = sqlite3.connect(":memory:")

Imported in 2.17 seconds or 460,405 per second

9. Faster Python code in the loop

We should consider to save every bit of computation inside an intensive loop, such as avoiding assignment to variable and string operations.

9a. Avoid assignment to variable

tokens = line.split()
c.executemany("INSERT INTO stocks VALUES (?,?,?)", [(tokens[0], float(tokens[1]), float(tokens[2]))])

Imported in 2.10 seconds or 475,964 per second

9b. Avoid string.split()

When we can treat the space separated data as fixed width format, we can directly indicate the distance between each data to the head of data. It means line.split()[1] becomes line[7:14]

c.executemany("INSERT INTO stocks VALUES (?,?,?)", [(line[0:6], float(line[7:14]), float(line[15:]))])

Imported in 1.94 seconds or 514,661 per second

9c. Avoid float() to ?

When we are using executemany() with ? placeholder, we don’t need to turn the string into float beforehand.

executemany("INSERT INTO stocks VALUES (?,?,?)", [(line[0:6], line[7:14], line[15:])])

Imported in 1.59 seconds or 630,520 per second

10. The fastest full functioned and robust code so far

import time
class Timer:    
    def __enter__(self):
        self.start = time.time()
        return self
    def __exit__(self, *args):
        elapsed = time.time()-self.start
        print('Imported in {:.2f} seconds or {:.0f} per second'.format(elapsed, 1*1000*1000/elapsed))
import sqlite3
conn = sqlite3.connect('example.db')
c = conn.cursor()
c.execute('''DROP TABLE IF EXISTS stocks''')
c.execute('''CREATE TABLE IF NOT EXISTS stocks
             (sym text, bid real, ask real)''')
c.execute('''PRAGMA synchronous = EXTRA''')
c.execute('''PRAGMA journal_mode = WAL''')
with Timer() as t:
    with open('input.ssv', 'r') as infile:
        lines = infile.read().splitlines()
        for line in lines:
            c.executemany("INSERT INTO stocks VALUES (?,?,?)", [(line[0:6], line[7:14], line[15:])])
        conn.commit()
        conn.close()

Imported in 1.77 seconds or 564,611 per second

Possible to get faster?

I have a 23MB file with 1 million records composing of a piece of text as symbol name and 2 floating point number as bid and ask. When you search pass above, the test result shows a 4.32 M inserts per second to plain file. When I insert to a robust SQLite database, it drops to 0.564 M inserts per second. What else you may think of to make it even faster in SQLite? What if not SQLite but other database system?

解决方案

If python's interpreter is actually a significant factor in timing (section 9) vs SQLite performance, you may find PyPy to improve performance significantly (Python's sqlite3 interface is implemented in pure python.) Here not much is done in pure python, but if you were doing more string operations or had for loops then it is worth it to switch from CPython.

Obviously if performance outside SQLite really matters you can try writing in a faster language like C/C++. Multi-threading may or may not help depending on how the database locks are implemented.

这篇关于如何提高 Python 3.6 中的 SQLite 插入性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆