用于按组滞后列的 SQLite 查询(和子查询) [英] SQLite query (and subqueries) for lagging a column by group

查看：27 发布时间：2021/8/23 19:51:23 sql r sqlite

本文介绍了用于按组滞后列的 SQLite 查询(和子查询)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 SQLite 表 myTable 有大约 32 亿条记录.随着时间的推移，它具有公司工人状态(0 或 1)的记录.同一个工人可以在不同的公司.该表如下所示:

I have a SQLite table myTable with some 3.2 billion records. It has records of workers's status (0 or 1) in companies over time. The same worker can be in different companies. The table looks like this:

companyID   workerID    timeVar     workerStatus
-------------------------------------
1           1           1           0
1           1           2           0
1           1           3           0
1           1           4           1
-------------------------------------
1           2           3           1
1           2           4           1
1           2           5           0
1           2           6           0
-------------------------------------
2           1           1           1
2           1           2           1
2           1           3           1
2           1           4           0
-------------------------------------
2           2           1           1
2           2           2           1
2           2           3           0
2           2           4           0
2           2           5           0
2           2           6           1
-------------------------------------

注意companyID、workerID、timeVar是键.

我想创建两列:

workerStatusMaxLag1 会判断前一个 timeVar 中的 workerStatus 是否等于 1.

workerStatusMaxLag1 will say whether workerStatus in the immediate previous timeVar was equal to 1.

workerStatusMaxLag2 会判断前两个 timeVar 中的 workerStatus 是否等于 1.

workerStatusMaxLag2 will say whether workerStatus in the two immediate previous timeVar was equal to 1.

明确地说，这就是我想要实现的目标:

To be clear, this is what I am trying to achieve:

companyID   workerID    timeVar     workerStatus    workerStatusMaxLag1     workerStatusMaxLag2
1           1           1           0               NULL                    NULL
1           1           2           0               1                       1
1           1           3           0               1                       1
1           1           4           1               1                       1
-----------------------------------------------------------------------------
1           2           3           1               1                       1
1           2           4           1               1                       1
1           2           5           0               1                       1
1           2           6           0               0                       1
-----------------------------------------------------------------------------
2           1           1           1               NULL                    NULL
2           1           2           1               1                       1
2           1           3           1               1                       1
2           1           4           0               1                       1
-----------------------------------------------------------------------------
2           2           1           1               NULL                    NULL
2           2           2           1               1                       1   
2           2           3           0               1                       1
2           2           4           0               1                       1
2           2           5           0               1                       1
2           2           6           1               0                       1
-----------------------------------------------------------------------------

注意

工人可以跨公司移动，并且此表中的两个新列仅在workerID 和timeVar 内变化，而不在companyID 内变化，除了 timeVar 对于公司 x 员工的每个组合可能不同的事实.

Note

Workers can move across companies, and that the two new columns in this table vary only within workerID and timeVar, but not across companyID, save for the fact that timeVar can be different for each combination of company x worker.

我在 R 中工作，但鉴于我的数据如此之大，我认为如果我单独使用 SQL 语句执行所有这些操作而不加载(切片)数据，我会更安全进入R.

I am working in R, but given that my data is so large, I think I am safer if I do all of this with SQL statements alone, and not load (slices of) the data into R.

我是 SQL 的新手，我尝试过一些小步骤，但还没有走得太远.我认为构建查询的第一步是

I am new to SQL and I have tried to give some baby steps, but have not gone too far. I think the first step in building my query is to

SELECT companyID, workerID, timeVar, MAX(workerStatus)
FROM myTable
GROUP BY workerID, timeVAR

这样我就知道每个时期的workerStatus.然后我想把这个重新加入表格，但首先我需要计算滞后.

So that I know workerStatus in each period. Then I thought of joining this back to the table, but first I need to calculate the lags.

我看过，但没走多远.此外，可能有不同的方法，性能可能会有很大差异.虽然我不需要最高性能(我只会这样做一次)，但我确实有一个非常重的表(32 亿行，或 100GB 以上的数据表)，而且我只是不希望我的机器崩溃(我使用的是 Windows 7、16GB RAM、Intel i7-2600).

I have looked but didn't go very far. Also, there might be different approaches and performance may vary considerably. While I don't need top performance (I will do this one time only), I do have a pretty heavy table (3.2 billion rows, or a 100GB+ table worth of data), and I just don't want my machine to crash (I'm on Windows 7, 16GB RAM, Intel i7-2600).

用于按组滞后列的 SQLite 查询(和子查询) [英] SQLite query (and subqueries) for lagging a column by group

问题描述

注意

Note

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录关闭

用于按组滞后列的 SQLite 查询(和子查询) [英] SQLite query (and subqueries) for lagging a column by group

问题描述

注意

Note

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭