用于按组滞后列的 SQLite 查询(和子查询) [英] SQLite query (and subqueries) for lagging a column by group

查看:27
本文介绍了用于按组滞后列的 SQLite 查询(和子查询)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 SQLitemyTable 有大约 32 亿条记录.随着时间的推移,它具有公司工人状态(0 或 1)的记录.同一个工人可以在不同的公司.该表如下所示:

I have a SQLite table myTable with some 3.2 billion records. It has records of workers's status (0 or 1) in companies over time. The same worker can be in different companies. The table looks like this:

companyID   workerID    timeVar     workerStatus
-------------------------------------
1           1           1           0
1           1           2           0
1           1           3           0
1           1           4           1
-------------------------------------
1           2           3           1
1           2           4           1
1           2           5           0
1           2           6           0
-------------------------------------
2           1           1           1
2           1           2           1
2           1           3           1
2           1           4           0
-------------------------------------
2           2           1           1
2           2           2           1
2           2           3           0
2           2           4           0
2           2           5           0
2           2           6           1
-------------------------------------

注意companyIDworkerIDtimeVar是键.

我想创建两列:

  1. workerStatusMaxLag1 会判断前一个 timeVar 中的 workerStatus 是否等于 1.

  1. workerStatusMaxLag1 will say whether workerStatus in the immediate previous timeVar was equal to 1.

workerStatusMaxLag2 会判断前两个 timeVar 中的 workerStatus 是否等于 1.

workerStatusMaxLag2 will say whether workerStatus in the two immediate previous timeVar was equal to 1.

明确地说,这就是我想要实现的目标:

To be clear, this is what I am trying to achieve:

companyID   workerID    timeVar     workerStatus    workerStatusMaxLag1     workerStatusMaxLag2
1           1           1           0               NULL                    NULL
1           1           2           0               1                       1
1           1           3           0               1                       1
1           1           4           1               1                       1
-----------------------------------------------------------------------------
1           2           3           1               1                       1
1           2           4           1               1                       1
1           2           5           0               1                       1
1           2           6           0               0                       1
-----------------------------------------------------------------------------
2           1           1           1               NULL                    NULL
2           1           2           1               1                       1
2           1           3           1               1                       1
2           1           4           0               1                       1
-----------------------------------------------------------------------------
2           2           1           1               NULL                    NULL
2           2           2           1               1                       1   
2           2           3           0               1                       1
2           2           4           0               1                       1
2           2           5           0               1                       1
2           2           6           1               0                       1
-----------------------------------------------------------------------------    

注意

工人可以跨公司移动,并且此表中的两个新列仅在workerIDtimeVar 内变化,而不在companyID 内变化,除了 timeVar 对于公司 x 员工的每个组合可能不同的事实.

Note

Workers can move across companies, and that the two new columns in this table vary only within workerID and timeVar, but not across companyID, save for the fact that timeVar can be different for each combination of company x worker.

我在 R 中工作,但鉴于我的数据如此之大,我认为如果我单独使用 SQL 语句执行所有这些操作而不加载(切片)数据,我会更安全进入R.

I am working in R, but given that my data is so large, I think I am safer if I do all of this with SQL statements alone, and not load (slices of) the data into R.

我是 SQL 的新手,我尝试过一些小步骤,但还没有走得太远.我认为构建查询的第一步是

I am new to SQL and I have tried to give some baby steps, but have not gone too far. I think the first step in building my query is to

SELECT companyID, workerID, timeVar, MAX(workerStatus)
FROM myTable
GROUP BY workerID, timeVAR

这样我就知道每个时期的workerStatus.然后我想把这个重新加入表格,但首先我需要计算滞后.

So that I know workerStatus in each period. Then I thought of joining this back to the table, but first I need to calculate the lags.

我看过,但没走多远.此外,可能有不同的方法,性能可能会有很大差异.虽然我不需要最高性能(我只会这样做一次),但我确实有一个非常重的表(32 亿行,或 100GB 以上的数据表),而且我只是不希望我的机器崩溃(我使用的是 Windows 7、16GB RAM、Intel i7-2600).

I have looked but didn't go very far. Also, there might be different approaches and performance may vary considerably. While I don't need top performance (I will do this one time only), I do have a pretty heavy table (3.2 billion rows, or a 100GB+ table worth of data), and I just don't want my machine to crash (I'm on Windows 7, 16GB RAM, Intel i7-2600).

推荐答案

您可以使用相关子查询表达查询.比如获取之前的状态

You can express the query using correlated subqueries. For instance, to get the previous status

select t.*,
       (select t2.status
        from myTable t2
        where t2.companyId = t.companyId and 
              t2.workerId = t.workerId and
              t2.timeVar < t.timeVar
        order by t2.timeVar desc
        limit 1
       ) 
from myTable t;

这个查询确实需要一个索引:myTable(companyId, workerId, TimeVar, Status).

This query really wants an index: myTable(companyId, workerId, TimeVar, Status).

即使有索引,对于如此复杂的查询来说,32 亿行也是相当大的.我不知道 SQLite 是否会完成.

Even with the index, though, 3.2 billion rows is pretty big for such a complicated query. I don't know if SQLite will complete.

这篇关于用于按组滞后列的 SQLite 查询(和子查询)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆