查找列值连续增加的行 [英] Finding rows with consecutive increase in the values of a column
问题描述
我有一个 sql 表,用于存储股票的每日价格.每天收盘后都会插入新记录.我想找到价格连续上涨的股票.
I have a sql table that stores the daily prices of stocks. New records are inserted every day after the market closes. I want to find the stocks that have consecutive increases in price.
该表有很多列,但这是相关的子集:
The table has lots of columns, but this is the relevant subset:
quoteid stockid closeprice createdate
--------------------------------------------------
1 1 1 01/01/2012
2 2 10 01/01/2012
3 3 15 01/01/2012
4 1 2 01/02/2012
5 2 11 01/02/2012
6 3 13 01/02/2012
7 1 5 01/03/2012
8 2 13 01/03/2012
9 3 17 01/03/2012
10 1 7 01/04/2012
11 2 14 01/04/2012
12 3 18 01/04/2012
13 1 9 01/05/2012
14 2 11 01/05/2012
15 3 10 01/05/2012
quoteid
列是主键.
在表中,股票编号 1 的收盘价每天都在上涨.股票编号3波动较大,股票编号2的价格在最后一天下跌.
In the table, the closing price of stock id 1 increases every day. Stock id 3 fluctuates a lot, and the price for stock id 2 fell on the last day.
我正在寻找这样的结果:
I am looking for a result like this:
stockid Consecutive Count (CC)
----------------------------------
1 5
2 4
如果你能得到带有连续连续记录日期的输出,那就更好了:
If you can get output with dates for the consecutive streak, that would even better:
stockid Consecutive Count (CC) StartDate EndDate
---------------------------------------------------------------
1 5 01/01/2012 01/05/2012
2 4 01/01/2012 01/04/2012
StartDate
是价格开始上涨的时间,EndDate
是牛市实际结束的时间.
StartDate
is when the price started increasing and EndDate
is when the bull run actually finished.
我发现这不是一个简单的问题.我看过这里的其他帖子,它们也处理这种连续的场景,但它们不符合我的需求.如果你知道任何与我相似的帖子,请告诉我.
I have figured this is not an easy problem. I have looked at other posts here which also deal with this consecutive scenario but they don't fit my needs. If you know any post that is similar to mine, please do let me know.
推荐答案
在任何情况下,把它放在增加每股票行数方面是有帮助的(实际的 quoteid
值不是在这里真的很有帮助).捕获的天数(在这个表中)是最简单的——如果你想要其他的东西(比如只有工作日,忽略周末/假期,或者其他什么),它会变得更复杂;您可能需要一个日历文件.如果您还没有 [stockid
, createdate
] 上的索引,您会想要一个索引.
In any case, it helps to put it in terms of increasing rows-per-stock (the actual quoteid
value isn't really helpful here). Count of days captured (in this table) is easiest - if you want something else (like only business days, ignoring weekends/holidays, or whatever) it gets more involved; you'd probably need a calendar file. You're going to want an index over [stockid
, createdate
], if you don't have one already.
WITH StockRow AS (SELECT stockId, closePrice, createdDate,
ROW_NUMBER() OVER(PARTITION BY stockId
ORDER BY createdDate) rn
FROM Quote),
RunGroup AS (SELECT Base.stockId, Base.createdDate,
MAX(Restart.rn) OVER(PARTITION BY Base.stockId
ORDER BY Base.createdDate) groupingId
FROM StockRow Base
LEFT JOIN StockRow Restart
ON Restart.stockId = Base.stockId
AND Restart.rn = Base.rn - 1
AND Restart.closePrice > Base.closePrice)
SELECT stockId,
COUNT(*) AS consecutiveCount,
MIN(createdDate) AS startDate, MAX(createdDate) AS endDate
FROM RunGroup
GROUP BY stockId, groupingId
HAVING COUNT(*) >= 3
ORDER BY stockId, startDate
根据提供的数据产生以下结果:
Which yields the following results from the provided data:
Increasing_Run
stockId consecutiveCount startDate endDate
===================================================
1 5 2012-01-01 2012-01-05
2 4 2012-01-01 2012-01-04
3 3 2012-01-02 2012-01-04
SQL 小提琴示例
(Fiddle 也有多次运行的例子)
SQL Fiddle Example
(Fiddle also has an example for multiple runs)
此分析将忽略所有间隙,正确匹配所有运行(下一次正运行开始时).
This analysis will ignore all gaps, correctly matches all runs (the next time a positive run starts).
这里发生了什么?
StockRow AS (SELECT stockId, closePrice, createdDate,
ROW_NUMBER() OVER(PARTITION BY stockId
ORDER BY createdDate) rn
FROM Quote)
这个 CTE 被用于一个目的:我们需要一种方法来找到下一行/上一行,所以首先我们按(日期)顺序对每一行进行编号......
This CTE is being used for one purpose: we need a way to find the next/previous row, so first we number each row in order (of the date)...
RunGroup AS (SELECT Base.stockId, Base.createdDate,
MAX(Restart.rn) OVER(PARTITION BY Base.stockId
ORDER BY Base.createdDate) groupingId
FROM StockRow Base
LEFT JOIN StockRow Restart
ON Restart.stockId = Base.stockId
AND Restart.rn = Base.rn - 1
AND Restart.closePrice > Base.closePrice)
... 然后根据索引加入它们.如果你最终得到了具有 LAG()
/LEAD()
的东西,那么使用它们几乎肯定是一个更好的选择.不过,这里有一件关键的事情 - 仅当该行乱序(小于前一行)时才匹配.否则,该值最终为 null
(使用 LAG()
,您需要在之后使用类似 CASE
的东西来解决这个问题).你会得到一个看起来像这样的临时集合:
... and then join them based on the index. If you end up on something that has LAG()
/LEAD()
, using those instead will almost certainly be a better option. There's one critical thing here though - matches are only if the row is out-of-sequence (less than the previous row). Otherwise, the value end up being null
(with LAG()
, you'd need to use something like CASE
afterwards to pull this off). You get a temporary set that looks something like this:
B.rn B.closePrice B.createdDate R.rn R.closePrice R.createdDate groupingId
1 15 2012-01-01 - - - -
2 13 2012-01-02 1 15 2012-01-01 1
3 17 2012-01-03 - - - 1
4 18 2012-01-04 - - - 1
5 10 2012-01-05 4 18 2012-01-04 4
... 所以只有当前一行大于当前"行时,才会有 Restart
的值.窗口函数中 MAX()
的使用被用于迄今为止看到的最大值......因为 null
是最低的,导致行索引为保留所有其他行,直到发生另一个不匹配(给出一个新值).在这一点上,我们基本上有了 gaps-and-islands 查询,准备进行最终聚合.
... So there's values for Restart
only when the previous was greater than the "current" row. The use of MAX()
in the window function is being used to the greatest value seen so far... which because null
is lowest, causes the row-index to be retained for all other rows until another mismatch occurs (which gives a new value). At this point, we essentially have the intermediate results of a gaps-and-islands query, ready for the final aggregation.
SELECT stockId,
COUNT(*) AS consecutiveCount,
MIN(createdDate) AS startDate, MAX(createdDate) AS endDate
FROM RunGroup
GROUP BY stockId, groupingId
HAVING COUNT(*) >= 3
ORDER BY stockId, startDate
查询的最后一部分是获取运行的开始和结束日期,并计算这些日期之间的条目数.如果日期计算有更复杂的事情,则可能需要在此时进行.GROUP BY
显示了少数几个合法的 not 实例之一,包括 SELECT
子句中的列.HAVING
子句用于消除太短"的运行.
The final part of the query is getting the start and end dates of the run, and counting the number of entries between those dates. If there was something more complicated for the date calculation, it probably needs to happen at this point. The GROUP BY
is showing one of the few legitimate instances of not including a column in the SELECT
clause. The HAVING
clause is used to eliminate runs that are "too short".
这篇关于查找列值连续增加的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!