用于按组滞后列的 SQLite 查询(和子查询) [英] SQLite query (and subqueries) for lagging a column by group
问题描述
我有一个 SQLite 表 myTable
有大约 32 亿条记录.随着时间的推移,它具有公司工人状态(0 或 1)的记录.同一个工人可以在不同的公司.该表如下所示:
I have a SQLite table myTable
with some 3.2 billion records. It has records of workers's status (0 or 1) in companies over time. The same worker can be in different companies. The table looks like this:
companyID workerID timeVar workerStatus
-------------------------------------
1 1 1 0
1 1 2 0
1 1 3 0
1 1 4 1
-------------------------------------
1 2 3 1
1 2 4 1
1 2 5 0
1 2 6 0
-------------------------------------
2 1 1 1
2 1 2 1
2 1 3 1
2 1 4 0
-------------------------------------
2 2 1 1
2 2 2 1
2 2 3 0
2 2 4 0
2 2 5 0
2 2 6 1
-------------------------------------
注意companyID
、workerID
、timeVar
是键.
我想创建两列:
workerStatusMaxLag1
会判断前一个timeVar
中的workerStatus
是否等于 1.
workerStatusMaxLag1
will say whetherworkerStatus
in the immediate previoustimeVar
was equal to 1.
workerStatusMaxLag2
会判断前两个 timeVar
中的 workerStatus
是否等于 1.
workerStatusMaxLag2
will say whether workerStatus
in the two immediate previous timeVar
was equal to 1.
明确地说,这就是我想要实现的目标:
To be clear, this is what I am trying to achieve:
companyID workerID timeVar workerStatus workerStatusMaxLag1 workerStatusMaxLag2
1 1 1 0 NULL NULL
1 1 2 0 1 1
1 1 3 0 1 1
1 1 4 1 1 1
-----------------------------------------------------------------------------
1 2 3 1 1 1
1 2 4 1 1 1
1 2 5 0 1 1
1 2 6 0 0 1
-----------------------------------------------------------------------------
2 1 1 1 NULL NULL
2 1 2 1 1 1
2 1 3 1 1 1
2 1 4 0 1 1
-----------------------------------------------------------------------------
2 2 1 1 NULL NULL
2 2 2 1 1 1
2 2 3 0 1 1
2 2 4 0 1 1
2 2 5 0 1 1
2 2 6 1 0 1
-----------------------------------------------------------------------------
注意
工人可以跨公司移动,并且此表中的两个新列仅在workerID
和timeVar
内变化,而不在companyID
内变化,除了 timeVar
对于公司 x 员工的每个组合可能不同的事实.
Note
Workers can move across companies, and that the two new columns in this table vary only within workerID
and timeVar
, but not across companyID
, save for the fact that timeVar
can be different for each combination of company x worker.
我在 R 中工作,但鉴于我的数据如此之大,我认为如果我单独使用 SQL 语句执行所有这些操作而不加载(切片)数据,我会更安全进入R.
I am working in R, but given that my data is so large, I think I am safer if I do all of this with SQL statements alone, and not load (slices of) the data into R.
我是 SQL 的新手,我尝试过一些小步骤,但还没有走得太远.我认为构建查询的第一步是
I am new to SQL and I have tried to give some baby steps, but have not gone too far. I think the first step in building my query is to
SELECT companyID, workerID, timeVar, MAX(workerStatus)
FROM myTable
GROUP BY workerID, timeVAR
这样我就知道每个时期的workerStatus
.然后我想把这个重新加入表格,但首先我需要计算滞后.
So that I know workerStatus
in each period. Then I thought of joining this back to the table, but first I need to calculate the lags.
我看过,但没走多远.此外,可能有不同的方法,性能可能会有很大差异.虽然我不需要最高性能(我只会这样做一次),但我确实有一个非常重的表(32 亿行,或 100GB 以上的数据表),而且我只是不希望我的机器崩溃(我使用的是 Windows 7、16GB RAM、Intel i7-2600).
I have looked but didn't go very far. Also, there might be different approaches and performance may vary considerably. While I don't need top performance (I will do this one time only), I do have a pretty heavy table (3.2 billion rows, or a 100GB+ table worth of data), and I just don't want my machine to crash (I'm on Windows 7, 16GB RAM, Intel i7-2600).
推荐答案
您可以使用相关子查询表达查询.比如获取之前的状态
You can express the query using correlated subqueries. For instance, to get the previous status
select t.*,
(select t2.status
from myTable t2
where t2.companyId = t.companyId and
t2.workerId = t.workerId and
t2.timeVar < t.timeVar
order by t2.timeVar desc
limit 1
)
from myTable t;
这个查询确实需要一个索引:myTable(companyId, workerId, TimeVar, Status)
.
This query really wants an index: myTable(companyId, workerId, TimeVar, Status)
.
即使有索引,对于如此复杂的查询来说,32 亿行也是相当大的.我不知道 SQLite 是否会完成.
Even with the index, though, 3.2 billion rows is pretty big for such a complicated query. I don't know if SQLite will complete.
这篇关于用于按组滞后列的 SQLite 查询(和子查询)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!