SQL查找表中首次出现的数据集 [英] SQL to find the first occurrence of sets of data in a table

查看:800
本文介绍了SQL查找表中首次出现的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我是否有一张桌子:

CREATE TABLE T
(
    TableDTM  TIMESTAMP  NOT NULL,
    Code      INT        NOT NULL
);

然后插入一些行:

INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:00:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:10:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:20:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:30:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:40:00', 0);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:50:00', 1);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:00:00', 1);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:10:00', 1);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:20:00', 0);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:30:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:40:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:50:00', 3);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 12:00:00', 3);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 12:10:00', 3);

所以我最终得到一个类似于以下内容的表:

So I end up with a table similar to:

2011-01-13 10:00:00, 5
2011-01-13 10:10:00, 5
2011-01-13 10:20:00, 5
2011-01-13 10:30:00, 5
2011-01-13 10:40:00, 0
2011-01-13 10:50:00, 1
2011-01-13 11:00:00, 1
2011-01-13 11:10:00, 1
2011-01-13 11:20:00, 0
2011-01-13 11:30:00, 5
2011-01-13 11:40:00, 5
2011-01-13 11:50:00, 3
2011-01-13 12:00:00, 3
2011-01-13 12:10:00, 3

我该如何选择每组相同数字的第一个日期,所以我得出以下结论:

How can I select the first date of each set of identical numbers, so I end up with this:

2011-01-13 10:00:00, 5
2011-01-13 10:40:00, 0
2011-01-13 10:50:00, 1
2011-01-13 11:20:00, 0
2011-01-13 11:30:00, 5
2011-01-13 11:50:00, 3

在一天中的大部分时间里,我一直在用子查询之类的东西,因为某种原因我似乎无法破解。我确定某个地方有一个简单的方法!

I've been messing about with sub queries and the like for most of the day and for some reason I can't seem to crack it. I'm sure there's a simple way somewhere!

我可能想从结果中排除0,但这并不重要。.

I would probably want to exclude the 0's from the results, but that's not important for now..

推荐答案

2011年1月15日修订



我确定

Revised 15 Jan 11

I'm sure there's a simple way somewhere!

是的。但是首先是两个问题。

Yes, there is. But first, two Issues.


  1. 该表不是关系数据库表。它没有唯一的密钥,这是RM和规范化所要求的(特别是每一行必须具有唯一的标识符;不一定是PK)。因此,SQL(一种用于在关系数据库表上操作的标准语言)无法对其执行基本操作。

  1. The table is not a Relational Database table. It does not have an unique key, which is demanded by the RM and Normalisation (specifically that each row must have an unique identifier; not necessarily a PK). Therefore SQL, a standard language, for operating on Relational Database tables, cannot perform basic operations on it.


  • 这是一个堆(数据结构)

  • 任何和所有使用SQL的操作都将非常缓慢,而且将是不正确的

  • 将ROWCOUNT设置为1,执行行处理,SQL将在Heap上正常运行

  • 您最好的选择是使用任何unix实用程序对其进行操作(awk,cut,chop) 。他们快得快。满足您的要求所需的awk脚本将花费3分钟来编写,并且将在几秒钟内运行数百万条记录(我上周写了几条)。


  • it is a Heap (data structure, inserted and deleted in chronological order), with records not rows.
  • any and all operations using SQL will be horribly slow, and will not be correct
  • SET ROWCOUNT to 1, perform row processing, and SQL will work on the Heap just fine
  • your best bet is use any unix utiliy to operate on it (awk, cut, chop). They are blindingly fast. The awk script required to answer your requirement would take 3 mins to write and it will run in seconds for millions of records (I wrote a few last week).
    .

所以问题确实是 SQL在非关系堆中查找数据集的第一个出现 >。

现在,如果您的问题是 SQL在关系表中查找数据集的第一个出现,那当然意味着有一些唯一的行标识符,这(a)在SQL中很容易,并且(b)在任何形式的SQL中都是快速的...

Now if your question was SQL to find the first occurrence of sets of data in a Relational table, implying of course some unique row identifier, that would be (a) easy in SQL, and (b) fast in any flavour of SQL ...


  • 除了Oracle,众所周知 处理子查询错误 (特别是Tony Andrews的评论,他是Oracle上的知名机构)。在这种情况下,请使用Materialized Views。


这个问题很笼统(无怨言)。但是,这些特定需求中的许多通常是在较大的上下文中应用的,并且上下文具有此处的规范所没有的要求。通常,需要一个简单的子查询(但在Oracle中,请使用实例化视图来避免该子查询)。子查询也取决于外部上下文,即外部查询。因此,较小的一般性问题的答案将不包含实际特定需求的答案。

The question is very generic (no complaint). But many of these specific needs are usually applied within a larger context, and the context has requirements which are absent from the specification here. Generally the need is for a simple Subquery (but in Oracle use a Materialised View to avoid the subquery). And the subquery, too, depends on the outer context, the outer query. Therefore the answer to the small generic question will not contain the answer to the actual specific need.






无论如何,我不想避免这个问题。我们为什么不使用真实的示例,而不是简单的通用示例?并在关系表中查找另一组数据中的一组数据的第一个或最后一次出现,或最小值或最大值

主要查询

让我们使用 ▶数据模型◀ 来自上一个问题。

Let's use the ▶Data Model◀ from your previous question.

报告自某个日期以来的所有警报,其持续时间的峰值不是确认的

Report all Alerts since a certain date, with the peak Value for the duration, that are not Acknowledged

由于您将对所有时间和历史记录要求使用完全相同的技术(具有不同的表名和列名),因此需要充分了解子查询的基本结构,并且

Since you will be using exactly the same technique (with different table and column names) for all your temporal and History requirements, you need to fully understand the basic construct of a Subquery, and its different applications.


请注意,不仅是带有关系标识符(复合键)的纯5NF数据库,您还具有完整的Tempo整个能力,并且在不破坏5NF(无更新异常)的情况下呈现了时间要求,这意味着得出了 ValidToDateTime 的期限和持续时间,并且在数据中没有重复。关键是,这使事情变得复杂,因此这不是子查询教程的最佳示例。

Note that you have, not only a pure 5NF Database, with Relational Identifiers (composite keys), you have full Temporal capability throughout, and the temporal requirement is rendered without breaking 5NF (No Update Anomalies), which means the ValidToDateTime for periods and durations is derived, and not duplicated in data. Point is, that complicates things, hence this is not the best example for a tutorial on Subqueries.




  • 请记住,SQL引擎是集合处理器,因此我们以面向集合的思维方式来解决问题


    • 不要使引擎呆滞以进行行处理; 非常慢

    • ,更重要的是不必要

      • Remember the SQL engine is a set-processor, so we approach the problem with a set-oriented mindset
        • do not dumb the engine down to row-processing; that is very slow
        • and more important, unnecessary

          • 如果无法在SQL中编码子查询,则将受到非常的限制;然后需要引入数据重复或将大型结果集用作实体化视图或临时表,或使用各种方式的附加数据和附加处理,这些操作非常慢,更不用说完全不必要

          • 如果您在不切换到行处理或内联视图或临时表,请寻求帮助,这是您在这里所做的。

          • if you cannot code subqueries in SQL, you will be very limited; and then need to introduce data duplication or use large result sets as Materialised Views or temporary tables or all manner of additional data and additional processing, which will be s.l.o.w to v.e.r.y s.l.o.w, not to mention completely unnecessary
          • if there is anything you cannot do in a truly Relational Database (and my Data Models always are) without switching to row-processing or inline views or temp tables, ask for help, which is what you have done here.

          首先使用最小联接构建外部查询,等等,基于您需要的结果集的结构,仅此而已。首先解析外部查询的结构非常重要;否则,您将来回尝试使子查询适合外部查询,反之亦然。

          First build the Outer query using minimum joins, etc, based on the structure of the result set that you need, and nothing more. It is very important that the structure of the outer query is resolved first; otherwise you will go back and forth trying to make the subquery fit the outer query, and vice versa.


          • 这恰好需要子查询作为好。因此,请暂时将其保留,稍后再取用。目前,外部查询会在特定日期后获取所有(不是未确认的)警报

          • That happens to require a Subquery as well. So leave that part out for now, and pick that up later. For now, the Outer query gets all (not un-acknowledged) Alerts after a certain date

          ▶SQL代码required 必须在第1页上(抱歉,SO编辑功能太糟糕了,它破坏了格式,并且代码已经被格式化了。)

          The ▶SQL code◀ required is on page 1 (sorry, the SO edit features are horrible, it destroys the formatting, and the code is already formatted).

          然后构建子查询以填充每个单元格。

          Then build the Subquery to fill each cell.

          子查询(1)派生 Alert.Value

          这是一个简单的派生数据,从中选择 Value 阅读会生成警报。这些表是相关的,基数为1 :: 1,因此它是PK上的直接连接。

          That is a simple derived datum, select the Value from the Reading that generated the Alert. The tables are related, the cardinality is 1::1, so it is a straight join on the PK.


          • 所需的子查询类型这是一个相关子查询,我们需要将外部查询中的表与(内部)子查询中的表相关联。为了做到这一点,


            • 我们需要在外部查询中使用该表的别名,以使其与子查询中的表相关。

            • 进行区分,我仅对此类必需的关联使用别名,对纯联接使用完全限定的名称

            • The type of Subquery required here is a Correlated Subquery, we need to correlate a table in the Outer query to a table in the (inner) Subquery.
              • in order to do that, we need an Alias for the table in the Outer query, to correlate it to a table in the Subquery.
              • to make the distinction, I have used aliases only for such required correlation, and fully qualified names for plain joins

              ▶SQL代码◀ 在第2页上。

              The ▶SQL code◀ required is on page 2.

              我有特意为您提供了外部查询中的联接与通过子查询获得数据的混合方式,以便您可以学习(也可以通过联接获得 Alert.Value ,但这会会变得更加繁琐)。

              I have purposely given you a mix of joins in the Outer Query vs obtaining data via Subquery, so that you can learn (you could alternately obtain Alert.Value via a join, but that would be even more cumbersome).

              我们需要的下一个子查询派生 Alert.PeakValue 。为此,我们需要确定警报的时间持续时间。我们已经开始了警报持续时间;我们需要确定持续时间的结尾,即下一个(临时) Reading.Value ,该值在范围内。这也需要一个子查询,我们最好先对其进行处理。

              The next Subquery we need derives Alert.PeakValue. For that we need to determine the Temporal Duration of the Alert. We have the beginning of the Alert Duration; we need to determine the end of the Duration, which is the next (temporally) Reading.Value that is within range. That requires a Subquery as well, which we better handle first.


              • 从内部开始,向外进行逻辑处理。好旧的BODMAS。

              子查询(2)派生 Alert.EndDtm

              稍微复杂一点的Suquery来选择第一个 Reading.ReadingDtm ,大于或等于 Alert.ReadingDtm ,其 Reading.Value 小于或等于其 Sensor.UpperLimit

              A slightly more complex Suquery to select the first Reading.ReadingDtm, that is greater than or equal to the Alert.ReadingDtm, that has a Reading.Value which is less than or equal to its Sensor.UpperLimit.

              处理5NF时间数据

              要处理5NF数据库中的时间要求(其中 EndDateTime 存储,重复数据也存储在其中),我们要进行工作仅在 StartDateTime 上,并且 EndDateTime 派生的:它是下一个 StartDateTime 。这是 Duration 的时间概念。

              For handling temporal requirements in a 5NF Database (in which EndDateTime is not stored, as is duplicate data), we work on a StartDateTime only, and the EndDateTime is derived: it is the next StartDateTime. This is the Temporal notion of Duration.


              • 从技术上讲,它是一毫秒(无论数据类型的分辨率如何使用)少。

              • 但是,为了合理起见,我们可以说并报告 EndDateTime ,就像 Next.StartDateTime ,并忽略一毫秒的问题。

              • 代码应始终使用> = This.StartDateTime < Next.StartDateTime


                • 消除了许多可避免的错误

                • 请注意,这些比较运算符将时间长度括在括号中,并且应该贯穿上文以常规方式使用的术语比较完全独立于与业务逻辑有关的类似比较运算符,例如。 Sensor.UpperLimit (即注意它,因为两者通常都位于一个 WHERE 子句中,这很容易以便混淆或混淆)。

                • Technically, it is one millisec (whatever the resolution for the Datatype used) less.
                • However, in order to be reasonable, we can speak of, and report, EndDateTime as simply the Next.StartDateTime, and ignore the one millisecond issue.
                • The code should always use >= This.StartDateTime and < Next.StartDateTime.
                  • That eliminates a slew of avoidable bugs
                  • Note that these comparison operators, which bracket the Temporal Duration, and should be used in a conventional manner throughout as per above, are quite independent of similar comparison operators related to business logic, eg. Sensor.UpperLimit (ie. watch for it, because both are often located in one WHERE clause, and it is easy to mix them up or get confused).

                  需要▶SQL代码◀ ,以及使用的测试数据在第3页上。

                  The ▶SQL code◀ required, along with test data used, is on page 3.

                  子查询(3)派生 Alert.PeakValue

                  现在很简单。在 Alert.ReadingDtm 读数中选择 MAX(Value) >和 Alert.EndDtm Alert 的持续时间。

                  Now it is easy. Select the MAX(Value) from Readings between Alert.ReadingDtm and Alert.EndDtm, the duration of the Alert.

                  ▶SQL代码◀ 在第4页上。

                  The ▶SQL code◀ required is on page 4.

                  标量子查询

                  除了是相关子查询之外,以上均为标量子查询,因为它们返回单个值;网格中的每个单元格只能填充一个值。 (返回多个值的非标量子查询非常合法,但不适用于上述情况。)

                  In addition to being Correlated Subqueries, the above are all Scalar Subqueries, as they return a single value; each cell in the grid can be filled with only one value. (Non-Scalar Subqueries, that return multiple values, are quite legal, but not for the above.)

                  子查询(4)确认的警报

                  好吧,现在您已经掌握了上面的相关标量子查询,这些子查询可以填充集合中的单元格,该集合由外部查询定义,让我们来看一下可以用来约束外部查询的子查询。我们实际上并不需要 all Alerts (上方),我们想要未确认的警报警报中存在的标识符,确认中不存在的标识符。那不是填充单元格,而是改变了外部集的内容。当然,这意味着更改 WHERE 子句。

                  Ok, now that you have a handle on the above Correlated Scalar Subqueries, those that fill cells in a set, a set that is defined by the Outer query, let's look at a Subquery that can be used to constrain the Outer query. We do not really want all Alerts (above), we want Un-Acknowledged Alerts: the Identifiers that exist in Alert, that do not exist in Acknowledgement. That is not filling cells, that is changing the content of the Outer set. Of course, that means changing the WHERE clause.


                  • 我们不会更改外部集的结构,因此 FROM 现有 的位置没有变化子句。

                  • We are not changing the structure of the Outer set, so there is no change to the FROM and existing WHERE clauses.

                  只需添加 WHERE 条件即可排除已确认的警报。 1 :: 1基数,直接相关联。

                  Simply add a WHERE condition to exclude the set of Acknowledged Alerts. 1::1 cardinality, straight Correlated join.

                  ▶SQL代码◀ 在第5页上。

                  The ▶SQL code◀ required is on page 5.

                  区别在于,这是一个非标量子查询,产生一组行(一个列)。我们有一整套的警报(外层)与一整套的致谢相匹配。

                  The difference is, this is a non-Scalar Subquery, producing a set of rows (one column). We have an entire set of Alerts (the Outer set) matched against an entire set of Acknowledgements.


                  • 之所以进行匹配,是因为我们已经通过使用别名告诉引擎子查询与相关相关(

                  • 使用 1 ,因为我们正在执行存在性检查。将其可视化为添加到由外部查询定义的警报集上的列。

                  • 请不要使用*,因为我们不需要整个一组列,并且速度会变慢

                  • 同样,由于未使用相关性,因此需要 WHERE NOT IN() ,但是,这再次构造了定义的列集,然后比较了这两个集。

                  • The matching is processed because we have told the engine that the Subquery is Correlated, by using an alias (no need for cumbersome joins to be identified)
                  • Use 1, because we are performing an existence check. Visualise it as a column added onto the Alert set defined by the Outer query.
                  • Never use * because we do not need the entire set of columns, and that will be slower
                  • Likewise, failing to use a correlation, means a WHERE NOT IN () is required, but again, that constructs the defined column set, then compares the two sets. Much slower.

                  子查询(5) 操作警报

                  作为外部查询的替代约束,对于未操作的警报,而不是(4) ,则排除操作警报的集合。直相关联。

                  As an alternative constraint on the Outer query, for un-actioned Alerts, instead of (4), exclude the set of Actioned Alerts. Straight Correlated join.

                  ▶SQL代码◀ 在第5页上。

                  The ▶SQL code◀ required is on page 5.

                  此代码已经过测试Sybase ASE 15.0.3,使用1000个警报和200个确认(不同组合);以及文档中标识的读数警报。所有执行的执行时间为零毫秒(0.003秒分辨率)。

                  This code has been tested on Sybase ASE 15.0.3 using 1000 Alerts and 200 Acknowledgements, of different combinations; and the Readings and Alerts identified in the document. Zero milliseconds execution time (0.003 second resolution) for all executions.

                  如果需要,这是 ▶文本格式的SQL代码◀

                  If you need it, here is the ▶SQL Code in Text Format◀.

                  (6) ▶从阅读中注册警报◀

                  此代码循环执行(提供) ,选择超出范围的新读数,并创建警报,除非适用警报已经存在。

                  (6) ▶Register Alert from Reading◀
                  This code executes in a loop (provided), selecting new Readings which are out-of-range, and creating Alerts, except where applicable Alerts already exist.

                  (7) ▶从阅读中加载警报◀

                  鉴于您有完整的 Readin的测试数据g ,此代码使用(6)的修改形式来加载适用的警报

                  (7) ▶Load Alert From Reading◀
                  Given that you have a full set of test data for Reading, this code uses a modified form of (6) to load the applicable Alerts.

                  当您知道如何时,它就是简单的。我再说一遍,写SQL而没有写子查询的能力是非常有限的。

                  It is "simple" when you know how. I repeat, writing SQL without the ability to write Subqueries is very limiting; it is essential for handling Relational Databases, which is what SQL was designed for.


                  • 开发人员实现非规范化数据堆(大规模数据重复)的原因有一半)是因为他们无法编写规范化结构


                    • 所需的子查询,这并不是因为他们已针对性能进行了规范化;这是因为它们无法为Normalized编码。我看过一百遍了。

                    • 这里的例子:您有一个完全规范化的关系数据库,困难在于对其进行编码,并且您正在考虑复制表以进行处理。

                    我想您可以找出剩余的查询内容。

                    I think you can figure out the remaining queries you have.

                    注意,此示例还恰好演示了使用关系标识符的强大功能,因为中间有几个表我们想要的那些键不必连接(是的!事实是,关系标识符比 Id 键意味着更少而不是更多的连接)。只需遵循实线即可。

                    Note, this example also happens to demonstrate the power of using Relational Identifiers, in that several tables in-between the ones we want do not have to be joined (yes! the truth is Relational Identifiers means less, not more, joins, than Id keys). Simply follow the solid lines.


                    • 您的时间需求需要包含 DateTime 的键。想象一下,尝试使用 Id PK对上面的代码进行编码,将有两个处理级别:一个用于连接(然后会有更多的处理),另一个用于数据处理。

                    • Your temporal requirement demands keys containing DateTime. Imagine trying to code the above with Id PKs, there would be two levels of processing: one for the joins (and there would be far more of them), and another for the data processing.

                    我尝试远离口语标签(嵌套,内部等),因为它们不是特定的,并遵循特定的技术术语。为了完整性和理解,

                    I try to stay away from colloquial labels ("nested", "inner", etc) because they are not specific, and stick to specific technical terms. For completeness and understanding:


                    • FROM 子句之后的子查询是 Materialised View (一种材料视图),它是在一个查询中派生的结果集,然后作为表送入另一个查询的 FROM 子句中。


                      • Oracle类型称为此内联视图。

                      • 在大多数情况下,您可以将相关子查询编写为物化视图,但这需要大量的I / O和处理(因为Oracle对子查询的处理非常糟糕,仅对于Oracle,物化视图是更快)。


                      • a Subquery after the FROM clause, is a Materialised View, a result set derived in one query and then fed into the FROM clause of another query, as a "table".
                        • The Oracle types call this Inline View.
                        • In most cases, you can write Correlated Subqueries as Materialised Views, but that is massively more I/O and processing (since Oracles handling of subqueries is abyssmal, for Oracle only, Materialised Views are "faster").
                          .

                        WHERE 子句中的子查询是谓词子查询,因为它会更改结果集的内容(基于它的谓词)。它可以返回标量(一个值)或非标量(许多值)。

                        A Subquery in the WHERE clause is a Predicate Subquery, because it changes the content of the result set (that which it is predicated upon). It can return either a Scalar (one value) or non-Scalar (many values).


                        • 对于标量,请使用 WHERE列= 或任何标量运算符

                        对于非标量,请使用 WHERE [NOT]存在 WHERE列[NOT] IN

                        for non-Scalars, use WHERE [NOT] EXISTS, or WHERE column [NOT] IN

                        WHERE 子句中的Suquery不需要相关;以下工作正常。识别所有多余的附属物:

                        A Suquery in the WHERE clause does not need to be Correlated; the following works just fine. Identify all superfluous appendages:

                        SELECT  [Never] = FirstName,
                                [Acted] = LastName 
                            FROM User 
                            WHERE UserId NOT IN ( SELECT DISTINCT UserId
                                FROM Action
                                )

                        这篇关于SQL查找表中首次出现的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆