如何在快速时间范围查询中将时间序列与Sqlite一起使用? [英] How to use time-series with Sqlite, with fast time-range queries?

查看:82
本文介绍了如何在快速时间范围查询中将时间序列与Sqlite一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们将事件记录在带有Unix时间戳列 ts 的Sqlite数据库中:

Let's say we log events in a Sqlite database with Unix timestamp column ts:

CREATE TABLE data(ts INTEGER, text TEXT);   -- more columns in reality

,我们希望快速查找日期时间范围,例如:

and that we want fast lookup for datetime ranges, for example:

SELECT text FROM data WHERE ts BETWEEN 1608710000 and 1608718654;

像这样, EXPLAIN QUERY PLAN 给出了 SCAN TABLE数据,这很糟糕,因此一种显而易见的解决方案是使用 创建索引>创建索引dt_idx ON数据(ts).

Like this, EXPLAIN QUERY PLAN gives SCAN TABLE data which is bad, so one obvious solution is to create an index with CREATE INDEX dt_idx ON data(ts).

然后问题解决了,但是对于必须为已经增加的序列/已经排序的列 ts 维护索引是一个较差的解决方案可以直接在O(log n)中使用 B树搜索.在内部,这将是索引:

Then the problem is solved, but it's rather a poor solution to have to maintain an index for an already-increasing sequence / already-sorted column ts for which we could use a B-tree search in O(log n) directly. Internally this will be the index:

ts           rowid
1608000001   1
1608000002   2
1608000012   3
1608000077   4

这是数据库空间的浪费(当查询必须首先在索引中查找时,这是CPU的浪费).

which is a waste of DB space (and CPU when a query has to look in the index first).

为避免这种情况:

  • (1)我们可以使用 ts 作为 INTEGER PRIMARY KEY ,因此 ts 将是 rowid 本身.但这失败了,因为 ts 不是唯一的:2个事件可以在同一秒(甚至在同一毫秒)发生.

  • (1) we could use ts as INTEGER PRIMARY KEY, so ts would be the rowid itself. But this fails because ts is not unique: 2 events can happen at the same second (or even at the same millisecond).

例如,请参见 SQLite自动增量中给出的信息.

See for example the info given in SQLite Autoincrement.

(2)我们可以使用 rowid 作为时间戳,将 ts 串联起来,并增加数量.示例:

(2) we could use rowid as timestamp ts concatenated with an increasing number. Example:

 16087186540001      
 16087186540002
 [--------][--]
     ts     increasing number 

然后 rowid 是唯一的并且严格增加(假设每秒少于1万个事件),并且不需要索引.一个查询 a AND b之间的ts 会简单地变成 a * 10000 AND b * 10000 + 9999 之间的whereid.

Then rowid is unique and strictly increasing (provided there are less than 10k events per second), and no index would be required. A query WHERE ts BETWEEN a AND b would simply become WHERE rowid BETWEEN a*10000 AND b*10000+9999.

但是有没有一种简单的方法让Sqlite对 rowid 大于或等于给定值的项目进行 INSERT ?假设当前时间戳为 1608718654 ,并且出现了两个事件:

But is there an easy way to ask Sqlite to INSERT an item with a rowid greater than or equal to a given value? Let's say the current timestamp is 1608718654 and two events appear:

  CREATE TABLE data(ts_and_incr INTEGER PRIMARY KEY AUTOINCREMENT, text TEXT);
  INSERT INTO data VALUES (NEXT_UNUSED(1608718654), "hello")  #16087186540001 
  INSERT INTO data VALUES (NEXT_UNUSED(1608718654), "hello")  #16087186540002

更一般地说,如何使用Sqlite优化创建时间序列以进行快速查询 a和b之间的时间戳?

More generally, how to create time-series optimally with Sqlite, to have fast queries WHERE timestamp BETWEEN a AND b?

推荐答案

第一个解决方案

问题中详细介绍的方法(2)似乎运行良好.在基准测试中,我获得了:

First solution

The method (2) detailed in the question seems to work well. In a benchmark, I obtained:

  • 天真的方法,不带索引:18 MB数据库,86毫秒查询时间
  • 天真方法,索引:32 MB数据库,12毫秒查询时间
  • 方法(2): 18 MB数据库,12 ms查询时间

关键点是使用 dt 作为 INTEGER PRIMARY KEY ,因此它将是行ID本身(另请参见是SQLite?),使用B树,并且不会是另一个隐藏的 rowid 列.因此,我们避免了额外的索引,该索引将使对应性 dt =>.rowid :此处 dt 行ID.

The key point is here to use dt as an INTEGER PRIMARY KEY, so it will be the row id itself (see also Is an index needed for a primary key in SQLite?), using a B-tree, and there will not be another hidden rowid column. Thus we avoid an extra index which would make a correspondance dt => rowid: here dt is the row id.

我们还使用 AUTOINCREMENT 在内部创建一个 sqlite_sequence 表,该表跟踪最后添加的ID.这在插入时很有用:由于两个事件可能具有相同的时间戳(以秒为单位)(即使使用毫秒或微秒的时间戳也可能,操作系统可能会截断精度),因此我们使用 timestamp * 10000之间的最大值 last_added_ID + 1 以确保其唯一性:

We also use AUTOINCREMENT which internally creates a sqlite_sequence table, which keeps track of the last added ID. This is useful when inserting: since it is possible that two events have the same timestamp in seconds (it would be possible even with milliseconds or microseconds timestamps, the OS could truncate the precision), we use the maximum between timestamp*10000 and last_added_ID + 1 to make sure it's unique:

 MAX(?, (SELECT seq FROM sqlite_sequence) + 1)

代码:

import sqlite3, random, time
db = sqlite3.connect('test.db')
db.execute("CREATE TABLE data(dt INTEGER PRIMARY KEY AUTOINCREMENT, label TEXT);")

t = 1600000000
for i in range(1000*1000):
    if random.randint(0, 100) == 0:  # timestamp increases of 1 second with probability 1%
        t += 1
    db.execute("INSERT INTO data(dt, label) VALUES (MAX(?, (SELECT seq FROM sqlite_sequence) + 1), 'hello');", (t*10000, ))
db.commit()

# t will range in a ~ 10 000 seconds window
t1, t2 = 1600005000*10000, 1600005100*10000  # time range of width 100 seconds (i.e. 1%)
start = time.time()
for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)): 
    pass
print(time.time()-start)


使用 WITHOUT ROWID

这是使用 WITHOUT ROWID 的另一种方法,该方法会给出 8 ms 查询时间.我们必须自己实现一个自动递增的ID,因为使用 WITHOUT ROWID 时AUTOINCREMENT不可用.
当我们想使用 PRIMARY KEY(dt,another_column1,another_column2,id)并避免有多余的 rowid 列时, WITHOUT ROWID 很有用..而不是为 rowid 提供一棵B树,为(dt,another_column1,...)提供一棵B树,我们只有一个.


Using a WITHOUT ROWID table

Here is another method with WITHOUT ROWID which gives a 8 ms query time. We have to implement an auto-incrementing id ourself, since AUTOINCREMENT is not available when using WITHOUT ROWID.
WITHOUT ROWID is useful when we want to use a PRIMARY KEY(dt, another_column1, another_column2, id) and avoid to have an extra rowid column. Instead of having one B-tree for rowid and one B-tree for (dt, another_column1, ...), we'll have just one.

db.executescript("""
    CREATE TABLE autoinc(num INTEGER); INSERT INTO autoinc(num) VALUES(0);

    CREATE TABLE data(dt INTEGER, id INTEGER, label TEXT, PRIMARY KEY(dt, id)) WITHOUT ROWID;
    
    CREATE TRIGGER insert_trigger BEFORE INSERT ON data BEGIN UPDATE autoinc SET num=num+1; END;
    """)

t = 1600000000
for i in range(1000*1000):
    if random.randint(0, 100) == 0: # timestamp increases of 1 second with probabibly 1%
        t += 1
    db.execute("INSERT INTO data(dt, id, label) VALUES (?, (SELECT num FROM autoinc), ?);", (t, 'hello'))
db.commit()

# t will range in a ~ 10 000 seconds window
t1, t2 = 1600005000, 1600005100  # time range of width 100 seconds (i.e. 1%)
start = time.time()
for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)): 
    pass
print(time.time()-start)


大致排序的UUID

通常,该问题与具有大致排序"的ID的ID有关.按日期时间.有关此的更多信息:


Roughly-sorted UUID

More generally, the problem is linked to having IDs that are "roughly-sorted" by datetime. More about this:

  • ULID (Universally Unique Lexicographically Sortable Identifier)
  • Snowflake
  • MongoDB ObjectId

所有这些方法使用的ID为:

All these methods use an ID which is:

[---- timestamp ----][---- random and/or incremental ----]

这篇关于如何在快速时间范围查询中将时间序列与Sqlite一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆