查找列值连续增加的行 [英] Finding rows with consecutive increase in the values of a column

查看:33
本文介绍了查找列值连续增加的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 sql 表,用于存储股票的每日价格.每天收盘后都会插入新记录.我想找到价格连续上涨的股票.

I have a sql table that stores the daily prices of stocks. New records are inserted every day after the market closes. I want to find the stocks that have consecutive increases in price.

该表有很多列,但这是相关的子集:

The table has lots of columns, but this is the relevant subset:

quoteid     stockid      closeprice     createdate
--------------------------------------------------
    1           1               1       01/01/2012
    2           2              10       01/01/2012
    3           3              15       01/01/2012

    4           1               2       01/02/2012
    5           2              11       01/02/2012
    6           3              13       01/02/2012

    7           1               5       01/03/2012
    8           2              13       01/03/2012
    9           3              17       01/03/2012

   10           1               7       01/04/2012
   11           2              14       01/04/2012
   12           3              18       01/04/2012

   13           1               9       01/05/2012
   14           2              11       01/05/2012
   15           3              10       01/05/2012

quoteid 列是主键.

在表中,股票编号 1 的收盘价每天都在上涨.股票编号3波动较大,股票编号2的价格在最后一天下跌.

In the table, the closing price of stock id 1 increases every day. Stock id 3 fluctuates a lot, and the price for stock id 2 fell on the last day.

我正在寻找这样的结果:

I am looking for a result like this:

stockid     Consecutive Count (CC)
----------------------------------
    1                5
    2                4

如果你能得到带有连续连续记录日期的输出,那就更好了:

If you can get output with dates for the consecutive streak, that would even better:

stockid     Consecutive Count (CC)      StartDate      EndDate
---------------------------------------------------------------
    1                5                 01/01/2012    01/05/2012
    2                4                 01/01/2012    01/04/2012

StartDate 是价格开始上涨的时间,EndDate 是牛市实际结束的时间.

StartDate is when the price started increasing and EndDate is when the bull run actually finished.

我发现这不是一个简单的问题.我看过这里的其他帖子,它们也处理这种连续的场景,但它们不符合我的需求.如果你知道任何与我相似的帖子,请告诉我.

I have figured this is not an easy problem. I have looked at other posts here which also deal with this consecutive scenario but they don't fit my needs. If you know any post that is similar to mine, please do let me know.

推荐答案

在任何情况下,把它放在增加每股票行数方面是有帮助的(实际的 quoteid 值不是在这里真的很有帮助).捕获的天数(在这个表中)是最简单的——如果你想要其他的东西(比如只有工作日,忽略周末/假期,或者其他什么),它会变得更复杂;您可能需要一个日历文件.如果您还没有 [stockid, createdate] 上的索引,您会想要一个索引.

In any case, it helps to put it in terms of increasing rows-per-stock (the actual quoteid value isn't really helpful here). Count of days captured (in this table) is easiest - if you want something else (like only business days, ignoring weekends/holidays, or whatever) it gets more involved; you'd probably need a calendar file. You're going to want an index over [stockid, createdate], if you don't have one already.

WITH StockRow AS (SELECT stockId, closePrice, createdDate,
                         ROW_NUMBER() OVER(PARTITION BY stockId 
                                           ORDER BY createdDate) rn
                  FROM Quote),

     RunGroup AS (SELECT Base.stockId, Base.createdDate,
                         MAX(Restart.rn) OVER(PARTITION BY Base.stockId
                                              ORDER BY Base.createdDate) groupingId
                  FROM StockRow Base
                  LEFT JOIN StockRow Restart
                         ON Restart.stockId = Base.stockId
                            AND Restart.rn = Base.rn - 1
                            AND Restart.closePrice > Base.closePrice)

SELECT stockId, 
       COUNT(*) AS consecutiveCount, 
       MIN(createdDate) AS startDate, MAX(createdDate) AS endDate
FROM RunGroup
GROUP BY stockId, groupingId
HAVING COUNT(*) >= 3
ORDER BY stockId, startDate

根据提供的数据产生以下结果:

Which yields the following results from the provided data:

Increasing_Run
stockId   consecutiveCount  startDate    endDate
===================================================
1         5                 2012-01-01   2012-01-05
2         4                 2012-01-01   2012-01-04
3         3                 2012-01-02   2012-01-04

SQL 小提琴示例
(Fiddle 也有多次运行的例子)

SQL Fiddle Example
(Fiddle also has an example for multiple runs)

此分析将忽略所有间隙,正确匹配所有运行(下一次正运行开始时).

This analysis will ignore all gaps, correctly matches all runs (the next time a positive run starts).

这里发生了什么?

StockRow AS (SELECT stockId, closePrice, createdDate,
                    ROW_NUMBER() OVER(PARTITION BY stockId 
                                      ORDER BY createdDate) rn
             FROM Quote)

这个 CTE 被用于一个目的:我们需要一种方法来找到下一行/上一行,所以首先我们按(日期)顺序对每一行进行编号......

This CTE is being used for one purpose: we need a way to find the next/previous row, so first we number each row in order (of the date)...

RunGroup AS (SELECT Base.stockId, Base.createdDate,
                    MAX(Restart.rn) OVER(PARTITION BY Base.stockId
                                         ORDER BY Base.createdDate) groupingId
             FROM StockRow Base
             LEFT JOIN StockRow Restart
                    ON Restart.stockId = Base.stockId
                       AND Restart.rn = Base.rn - 1
                           AND Restart.closePrice > Base.closePrice)

... 然后根据索引加入它们.如果你最终得到了具有 LAG()/LEAD() 的东西,那么使用它们几乎肯定是一个更好的选择.不过,这里有一件关键的事情 - 仅当该行乱序(小于前一行)时才匹配.否则,该值最终为 null(使用 LAG(),您需要在之后使用类似 CASE 的东西来解决这个问题).你会得到一个看起来像这样的临时集合:

... and then join them based on the index. If you end up on something that has LAG()/LEAD(), using those instead will almost certainly be a better option. There's one critical thing here though - matches are only if the row is out-of-sequence (less than the previous row). Otherwise, the value end up being null (with LAG(), you'd need to use something like CASE afterwards to pull this off). You get a temporary set that looks something like this:

B.rn   B.closePrice   B.createdDate  R.rn   R.closePrice   R.createdDate  groupingId
1      15             2012-01-01     -      -              -              -
2      13             2012-01-02     1      15             2012-01-01     1
3      17             2012-01-03     -      -              -              1
4      18             2012-01-04     -      -              -              1
5      10             2012-01-05     4      18             2012-01-04     4

... 所以只有当前一行大于当前"行时,才会有 Restart 的值.窗口函数中 MAX() 的使用被用于迄今为止看到的最大值......因为 null 是最低的,导致行索引为保留所有其他行,直到发生另一个不匹配(给出一个新值).在这一点上,我们基本上有了 查询,准备进行最终聚合.

... So there's values for Restart only when the previous was greater than the "current" row. The use of MAX() in the window function is being used to the greatest value seen so far... which because null is lowest, causes the row-index to be retained for all other rows until another mismatch occurs (which gives a new value). At this point, we essentially have the intermediate results of a gaps-and-islands query, ready for the final aggregation.

SELECT stockId, 
       COUNT(*) AS consecutiveCount, 
       MIN(createdDate) AS startDate, MAX(createdDate) AS endDate
FROM RunGroup
GROUP BY stockId, groupingId
HAVING COUNT(*) >= 3
ORDER BY stockId, startDate

查询的最后一部分是获取运行的开始和结束日期,并计算这些日期之间的条目数.如果日期计算有更复杂的事情,则可能需要在此时进行.GROUP BY 显示了少数几个合法的 not 实例之一,包括 SELECT 子句中的列.HAVING 子句用于消除太短"的运行.

The final part of the query is getting the start and end dates of the run, and counting the number of entries between those dates. If there was something more complicated for the date calculation, it probably needs to happen at this point. The GROUP BY is showing one of the few legitimate instances of not including a column in the SELECT clause. The HAVING clause is used to eliminate runs that are "too short".

这篇关于查找列值连续增加的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆