如何提高 Python 3.6 中的 SQLite 插入性能? [英] How to improve SQLite insert performance in Python 3.6?
问题描述
背景
我想使用 Python 向 SQLite 插入 100 万条记录.我尝试了多种方法来改进它,但仍然不太满意.数据库将文件加载到内存使用 0.23 秒(在下面搜索 pass
)但 SQLite 1.77 秒加载和插入到文件.
环境
英特尔酷睿 i7-7700 @ 3.6GHz
16GB 内存
美光 1100 256GB 固态硬盘,Windows 10 x64
Python 3.6.5 Minconda
sqlite3.version 2.6.0
生成数据.py
我用与我的真实数据相同的格式生成了 100 万个测试输入数据.
导入时间start_time = time.time()与 open('input.ssv', 'w') 一样:符号 = ['AUDUSD','EURUSD','GBPUSD','NZDUSD','USDCAD','USDCHF','USDJPY','USDCNY','USDHKD']行 = []对于 i 在范围内(0,1*1000*1000):q1, r1, q2, r2 = i//100000, i%100000, (i+1)//100000, (i+1)%100000line = '{} {}.{:05d} {}.{:05d}'.format(symbols[i%len(symbols)], q1, r1, q2, r2)行.追加(行)out.write('\n'.join(lines))打印(时间.时间()-开始时间,我)
input.ssv
测试数据如下所示.
AUDUSD 0.00000 0.00001欧元美元 0.00001 0.00002英镑兑美元 0.00002 0.00003纽元兑美元 0.00003 0.00004美元加元 0.00004 0.00005...美元瑞郎 9.99995 9.99996美元日元 9.99996 9.99997美元人民币 9.99997 9.99998美元港元 9.99998 9.99999澳元兑美元 9.99999 10.00000//总共 100 万行,Python 代码生成到磁盘需要 1.38 秒
Windows 正确显示 23,999,999 字节的文件大小.
基线代码 InsertData.py
导入时间类定时器:def __enter__(self):self.start = time.time()回归自我def __exit__(self, *args):elapsed = time.time()-self.start打印('以 {:.2f} 秒或 {:.0f} 每秒导入'.format(elapsed, 1*1000*1000/elapsed))使用 Timer() 作为 t:使用 open('input.ssv', 'r') 作为 infile:infile.read()
基本输入/输出
with open('input.ssv', 'r') as infile:infile.read()
<块引用>
以 0.13 秒或每秒 7.6 M 的速度导入
测试读取速度.
with open('input.ssv', 'r') as infile:使用 open('output.ssv', 'w') 作为输出文件:outfile.write(infile.read())//在这里插入
<块引用>
以 0.26 秒或每秒 3.84 M 的速度导入
它在不解析任何东西的情况下测试读写速度
with open('input.ssv', 'r') as infile:行 = infile.read().splitlines()对于线中线:pass # 在这里插入
<块引用>
以 0.23 秒或每秒 4.32 M 的速度导入
当我逐行解析数据时,它实现了非常高的输出.
这让我们了解我的测试机器上的 IO 和字符串处理操作有多快.
1.写文件
outfile.write(line)
<块引用>
以 0.52 秒或每秒 1.93 M 的速度导入
2.拆分为浮动到字符串
tokens = line.split()sym,bid, ask = tokens[0], float(tokens[1]), float(tokens[2])outfile.write('{} {:.5f} {%.5f}\n'.format(sym,bid, ask))//这里真正插入
<块引用>
在 2.25 秒内导入或每秒 445 K
3.使用自动提交插入语句
conn = sqlite3.connect('example.db',isolation_level=None)c.execute("插入股票值('{}',{:.5f},{:.5f})".format(sym,bid,ask))
<块引用>
当isolation_level = None(自动提交)时,程序需要好几个小时才能完成(我等不及这么长时间了)
注意输出数据库文件大小为 32,325,632 字节,即 32MB.它比输入文件 ssv 文件大小 23MB 大 10MB.
4.使用 BEGIN (DEFERRED) 插入语句
conn = sqlite3.connect('example.db', isolation_level='DEFERRED') # 默认c.execute("插入股票值('{}',{:.5f},{:.5f})".format(sym,bid,ask))
<块引用>
在 7.50 秒内导入或每秒 133,296 个
这与编写BEGIN
、BEGIN TRANSACTION
或BEGIN DEFERRED TRANSACTION
相同,而不是BEGIN IMMEDIATE
或开始独家
.
5.按准备好的语句插入
使用上面的事务得到了令人满意的结果,但需要注意的是,使用 Python 的字符串操作是不可取的,因为它会受到 SQL 注入.此外,与参数替换相比,使用字符串速度较慢.
c.executemany("INSERT INTO stock VALUES (?,?,?)", [(sym,bid,ask)])
<块引用>
在 2.31 秒内导入或每秒 432,124 个
6.关闭同步
在数据到达物理磁盘表面之前同步未设置为EXTRA
或FULL
时,电源故障会损坏数据库文件.当我们可以保证电源和操作系统健康时,我们可以将同步转为OFF
,这样数据交给操作系统层后不同步.
conn = sqlite3.connect('example.db',isolation_level='DEFERRED')c = conn.cursor()c.execute('''PRAGMA 同步 = OFF''')
<块引用>
在 2.25 秒内导入或每秒 444,247 个
7.关闭日志,因此没有回滚或原子提交
在某些应用程序中,不需要数据库的回滚功能,例如时间序列数据插入.当我们可以确保电源和操作系统健康时,我们可以将 journal_mode
设置为 off
,以便完全禁用回滚日志并禁用原子提交和回滚功能.>
conn = sqlite3.connect('example.db',isolation_level='DEFERRED')c = conn.cursor()c.execute('''PRAGMA 同步 = OFF''')c.execute('''PRAGMA journal_mode = OFF''')
<块引用>
在 2.22 秒内导入或每秒 450,653 个
8.使用内存数据库
在某些应用程序中不需要将数据写回磁盘,例如向 Web 应用程序提供查询数据的应用程序.
conn = sqlite3.connect(":memory:")
<块引用>
在 2.17 秒或每秒 460,405 次导入
9.循环中更快的 Python 代码
我们应该考虑将每一位计算都保存在一个密集循环中,例如避免对变量和字符串操作赋值.
9a.避免给变量赋值
tokens = line.split()c.executemany("插入股票值(?,?,?)", [(tokens[0], float(tokens[1]), float(tokens[2]))])
<块引用>
在 2.10 秒内导入或每秒 475,964 个
9b.避免 string.split()
当我们可以把空格分隔的数据当作固定宽度格式时,我们可以直接表示每个数据到数据头部的距离.这意味着 line.split()[1]
变成了 line[7:14]
c.executemany("INSERT INTO stock VALUES (?,?,?)", [(line[0:6], float(line[7:14]), float(line[15:])))])
<块引用>
在 1.94 秒内导入或每秒 514,661 个
9c.避免 float() 到 ?
当我们使用带有 ?
占位符的 executemany()
时,我们不需要事先将字符串转为浮点数.
executemany("INSERT INTO stock VALUES (?,?,?)", [(line[0:6], line[7:14], line[15:])])
<块引用>
在 1.59 秒内导入或每秒 630,520 个
10.迄今为止最快的全功能和健壮的代码
导入时间类定时器:def __enter__(self):self.start = time.time()回归自我def __exit__(self, *args):elapsed = time.time()-self.start打印('以 {:.2f} 秒或 {:.0f} 每秒导入'.format(elapsed, 1*1000*1000/elapsed))导入 sqlite3conn = sqlite3.connect('example.db')c = conn.cursor()c.execute('''DROP TABLE IF EXISTS股票''')c.execute('''CREATE TABLE 如果不存在股票(符号文本,出价真实,询问真实)''')c.execute('''PRAGMA 同步 = EXTRA''')c.execute('''PRAGMA journal_mode = WAL''')使用 Timer() 作为 t:使用 open('input.ssv', 'r') 作为 infile:行 = infile.read().splitlines()对于线中线:c.executemany("插入股票值(?,?,?)", [(line[0:6], line[7:14], line[15:])])conn.commit()conn.close()
<块引用>
导入时间为 1.77 秒或每秒 564,611 次
可以更快吗?
我有一个 23MB 的文件,其中包含 100 万条记录,由一段文本作为符号名称和 2 个浮点数作为买卖.当您搜索上面的 pass
时,测试结果显示每秒向纯文件插入 4.32 M.当我插入到一个强大的 SQLite 数据库时,它下降到每秒 0.564 M 次插入.在 SQLite 中你还能想到什么让它更快?如果不是 SQLite 而是其他数据库系统怎么办?
如果 Python 的解释器实际上是计时(第 9 节)与 SQLite 性能的重要因素,您可能会发现 PyPy 显着提高性能(Python 的 sqlite3 接口是在纯 python 中实现的.)这里在纯 python 中做的不多,但如果你做更多的字符串操作或有 for 循环,那么从 CPython 切换是值得的.
显然,如果SQLite 之外的性能真的很重要,您可以尝试使用更快的语言(如 C/C++)编写.多线程可能有帮助,也可能无济于事,具体取决于数据库锁的实现方式.
Background
I would like to insert 1-million records to SQLite using Python. I tried a number of ways to improve it but it is still not so satisfied. The database load file to memory using 0.23 second (search pass
below) but SQLite 1.77 second to load and insert to file.
Environment
Intel Core i7-7700 @ 3.6GHz
16GB RAM
Micron 1100 256GB SSD, Windows 10 x64
Python 3.6.5 Minconda
sqlite3.version 2.6.0
GenerateData.py
I generate the 1 million test input data with the same format as my real data.
import time
start_time = time.time()
with open('input.ssv', 'w') as out:
symbols = ['AUDUSD','EURUSD','GBPUSD','NZDUSD','USDCAD','USDCHF','USDJPY','USDCNY','USDHKD']
lines = []
for i in range(0,1*1000*1000):
q1, r1, q2, r2 = i//100000, i%100000, (i+1)//100000, (i+1)%100000
line = '{} {}.{:05d} {}.{:05d}'.format(symbols[i%len(symbols)], q1, r1, q2, r2)
lines.append(line)
out.write('\n'.join(lines))
print(time.time()-start_time, i)
input.ssv
The test data looks like this.
AUDUSD 0.00000 0.00001
EURUSD 0.00001 0.00002
GBPUSD 0.00002 0.00003
NZDUSD 0.00003 0.00004
USDCAD 0.00004 0.00005
...
USDCHF 9.99995 9.99996
USDJPY 9.99996 9.99997
USDCNY 9.99997 9.99998
USDHKD 9.99998 9.99999
AUDUSD 9.99999 10.00000
// total 1 million of lines, taken 1.38 second for Python code to generate to disk
Windows correctly shows 23,999,999 bytes file size.
Baseline Code InsertData.py
import time
class Timer:
def __enter__(self):
self.start = time.time()
return self
def __exit__(self, *args):
elapsed = time.time()-self.start
print('Imported in {:.2f} seconds or {:.0f} per second'.format(elapsed, 1*1000*1000/elapsed))
with Timer() as t:
with open('input.ssv', 'r') as infile:
infile.read()
Basic I/O
with open('input.ssv', 'r') as infile:
infile.read()
Imported in 0.13 seconds or 7.6 M per second
It tests the read speed.
with open('input.ssv', 'r') as infile:
with open('output.ssv', 'w') as outfile:
outfile.write(infile.read()) // insert here
Imported in 0.26 seconds or 3.84 M per second
It tests the read and write speed without parsing anything
with open('input.ssv', 'r') as infile:
lines = infile.read().splitlines()
for line in lines:
pass # do insert here
Imported in 0.23 seconds or 4.32 M per second
When I parse the data line by line, it achieves a very high output.
This gives us a sense about how fast the IO and string processing operations on my testing machine.
1. Write File
outfile.write(line)
Imported in 0.52 seconds or 1.93 M per second
2. Split to floats to string
tokens = line.split()
sym, bid, ask = tokens[0], float(tokens[1]), float(tokens[2])
outfile.write('{} {:.5f} {%.5f}\n'.format(sym, bid, ask)) // real insert here
Imported in 2.25 seconds or 445 K per second
3. Insert Statement with autocommit
conn = sqlite3.connect('example.db', isolation_level=None)
c.execute("INSERT INTO stocks VALUES ('{}',{:.5f},{:.5f})".format(sym,bid,ask))
When isolation_level = None (autocommit), program takes many hours to complete (I could not wait for such a long hours)
Note the output database file size is 32,325,632 bytes, which is 32MB. It is bigger than the input file ssv file size of 23MB by 10MB.
4. Insert Statement with BEGIN (DEFERRED)
conn = sqlite3.connect('example.db', isolation_level=’DEFERRED’) # default
c.execute("INSERT INTO stocks VALUES ('{}',{:.5f},{:.5f})".format(sym,bid,ask))
Imported in 7.50 seconds or 133,296 per second
This is the same as writing BEGIN
, BEGIN TRANSACTION
or BEGIN DEFERRED TRANSACTION
, not BEGIN IMMEDIATE
nor BEGIN EXCLUSIVE
.
5. Insert by Prepared Statement
Using the transaction above gives a satisfactory results but it should be noted that using Python’s string operations is undesired because it is subjected to SQL injection. Moreover using string is slow compared to parameter substitution.
c.executemany("INSERT INTO stocks VALUES (?,?,?)", [(sym,bid,ask)])
Imported in 2.31 seconds or 432,124 per second
6. Turn off Synchronous
Power failure corrupts the database file when synchronous is not set to EXTRA
nor FULL
before data reaches the physical disk surface. When we can ensure the power and OS is healthy, we can turn synchronous to OFF
so that it doe not synchronized after data handed to OS layer.
conn = sqlite3.connect('example.db', isolation_level='DEFERRED')
c = conn.cursor()
c.execute('''PRAGMA synchronous = OFF''')
Imported in 2.25 seconds or 444,247 per second
7. Turn off journal and so no rollback nor atomic commit
In some applications the rollback function of a database is not required, for example a time series data insertion. When we can ensure the power and OS is healthy, we can turn journal_mode
to off
so that rollback journal is disabled completely and it disables the atomic commit and rollback capabilities.
conn = sqlite3.connect('example.db', isolation_level='DEFERRED')
c = conn.cursor()
c.execute('''PRAGMA synchronous = OFF''')
c.execute('''PRAGMA journal_mode = OFF''')
Imported in 2.22 seconds or 450,653 per second
8. Using in-memory database
In some applications writing data back to disks is not required, such as applications providing queried data to web applications.
conn = sqlite3.connect(":memory:")
Imported in 2.17 seconds or 460,405 per second
9. Faster Python code in the loop
We should consider to save every bit of computation inside an intensive loop, such as avoiding assignment to variable and string operations.
9a. Avoid assignment to variable
tokens = line.split()
c.executemany("INSERT INTO stocks VALUES (?,?,?)", [(tokens[0], float(tokens[1]), float(tokens[2]))])
Imported in 2.10 seconds or 475,964 per second
9b. Avoid string.split()
When we can treat the space separated data as fixed width format, we can directly indicate the distance between each data to the head of data.
It means line.split()[1]
becomes line[7:14]
c.executemany("INSERT INTO stocks VALUES (?,?,?)", [(line[0:6], float(line[7:14]), float(line[15:]))])
Imported in 1.94 seconds or 514,661 per second
9c. Avoid float() to ?
When we are using executemany()
with ?
placeholder, we don’t need to turn the string into float beforehand.
executemany("INSERT INTO stocks VALUES (?,?,?)", [(line[0:6], line[7:14], line[15:])])
Imported in 1.59 seconds or 630,520 per second
10. The fastest full functioned and robust code so far
import time
class Timer:
def __enter__(self):
self.start = time.time()
return self
def __exit__(self, *args):
elapsed = time.time()-self.start
print('Imported in {:.2f} seconds or {:.0f} per second'.format(elapsed, 1*1000*1000/elapsed))
import sqlite3
conn = sqlite3.connect('example.db')
c = conn.cursor()
c.execute('''DROP TABLE IF EXISTS stocks''')
c.execute('''CREATE TABLE IF NOT EXISTS stocks
(sym text, bid real, ask real)''')
c.execute('''PRAGMA synchronous = EXTRA''')
c.execute('''PRAGMA journal_mode = WAL''')
with Timer() as t:
with open('input.ssv', 'r') as infile:
lines = infile.read().splitlines()
for line in lines:
c.executemany("INSERT INTO stocks VALUES (?,?,?)", [(line[0:6], line[7:14], line[15:])])
conn.commit()
conn.close()
Imported in 1.77 seconds or 564,611 per second
Possible to get faster?
I have a 23MB file with 1 million records composing of a piece of text as symbol name and 2 floating point number as bid and ask. When you search pass
above, the test result shows a 4.32 M inserts per second to plain file. When I insert to a robust SQLite database, it drops to 0.564 M inserts per second. What else you may think of to make it even faster in SQLite? What if not SQLite but other database system?
If python's interpreter is actually a significant factor in timing (section 9) vs SQLite performance, you may find PyPy to improve performance significantly (Python's sqlite3 interface is implemented in pure python.) Here not much is done in pure python, but if you were doing more string operations or had for loops then it is worth it to switch from CPython.
Obviously if performance outside SQLite really matters you can try writing in a faster language like C/C++. Multi-threading may or may not help depending on how the database locks are implemented.
这篇关于如何提高 Python 3.6 中的 SQLite 插入性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!