标记重复记录的 T-SQL 查询 [英] T-SQL query to flag repeat records

查看:33
本文介绍了标记重复记录的 T-SQL 查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 500,000 多条记录的表.每条记录都有一个 LineNumber 字段,它不是唯一的,也不是主键的一部分.每条记录都有一个 CreatedOn 字段.

I have a table that will have 500,000+ records. Each record has a LineNumber field which is not unique and not part of the primary key. Each record has a CreatedOn field.

我需要更新所有 500,000 多条记录以识别重复记录.

I need to update all 500,000+ records to identify repeat records.

重复记录定义为在其 CreatedOn 字段的最后 7 天内具有相同 LineNumber 的记录.

A repeat records is defined by a record that has the same LineNumber within the last seven days of its CreatedOn field.

在上图中,第 4 行是重复的,因为它自第 1 行起仅发生了五天.尽管第 6 行距第 4 行仅发生了四天,但第 6 行不是重复,但第 4 行本身已经是重复,因此第 6 行只能与第 6 行前 9 天的第 1 行进行比较,因此第 6 行不是重复.

In the diagram above row 4 is a repeat because it occurred only five days since row 1. Row 6 is not a repeat even though it occurs only four days since row 4, but row 4 itself is already a repeat, so Row 6 can only be compared to Row 1 which is nine days prior to Row 6, therefore Row 6 is not a repeat.

我不知道如何通过游标或其他方式逐条浏览每条记录来更新 IsRepeat 字段.

I do not know how to update the IsRepeat field with stepping through each record one-by-one via a cursor or something.

我不相信游标是可行的方法,但我坚持使用任何其他可能的解决方案.

I do not believe cursors is the way to go, but I'm stuck with any other possible solution.

我曾考虑过通用表表达式可能会有所帮助,但我没有使用它们的经验,也不知道从哪里开始.

I have considered maybe Common Table Expressions may be of help but I have no experience with them and have no idea where to start.

基本上每天都需要在表上执行相同的过程,因为表每天都会被截断和重新填充.重新填充表格后,我必须检查并重新标记每条记录是否重复.

Basically this same process needs to be done on the table every day as the table is truncated and re-populated every single day. Once the table is re-populated, I have to go through and re-mark each record if it is a repeat or not.

非常感谢您的帮助.

更新

这是一个创建表格并插入测试数据的脚本

Here is a script to create a table and insert test data

USE [Test]
GO

/****** Object:  Table [dbo].[Job]    Script Date: 08/18/2009 07:55:25 ******/
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Job]') AND type in (N'U'))
DROP TABLE [dbo].[Job]
GO

USE [Test]
GO

/****** Object:  Table [dbo].[Job]    Script Date: 08/18/2009 07:55:25 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Job]') AND type in (N'U'))
BEGIN
CREATE TABLE [dbo].[Job](
    [JobID] [int] IDENTITY(1,1) NOT NULL,
    [LineNumber] [nvarchar](20) NULL,
    [IsRepeat] [bit] NULL,
    [CreatedOn] [smalldatetime] NOT NULL,
 CONSTRAINT [PK_Job] PRIMARY KEY CLUSTERED 
(
    [JobID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
END
GO


SET NOCOUNT ON

INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-01 07:52:08')
INSERT INTO dbo.Job VALUES ('1019',NULL,'2009-07-01 08:30:01')
INSERT INTO dbo.Job VALUES ('1028',NULL,'2009-07-01 09:30:35')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-01 10:51:10')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-02 09:22:30')
INSERT INTO dbo.Job VALUES ('1027',NULL,'2009-07-02 10:27:28')
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-02 11:15:33')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-02 13:01:13')
INSERT INTO dbo.Job VALUES ('1014',NULL,'2009-07-03 12:05:56')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-03 13:57:34')
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-03 15:38:54')
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-04 16:32:20')
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-05 13:46:46')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-05 15:08:35')
INSERT INTO dbo.Job VALUES ('1000',NULL,'2009-07-05 15:19:50')
INSERT INTO dbo.Job VALUES ('1011',NULL,'2009-07-05 16:37:19')
INSERT INTO dbo.Job VALUES ('1019',NULL,'2009-07-05 17:14:09')
INSERT INTO dbo.Job VALUES ('1009',NULL,'2009-07-05 20:55:08')
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-06 08:29:29')
INSERT INTO dbo.Job VALUES ('1002',NULL,'2009-07-07 11:22:38')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-07 12:25:23')
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-08 09:32:07')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-08 09:46:33')
INSERT INTO dbo.Job VALUES ('1016',NULL,'2009-07-08 10:09:08')
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-09 10:45:04')
INSERT INTO dbo.Job VALUES ('1027',NULL,'2009-07-09 11:31:23')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-09 13:10:06')
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-09 15:04:06')
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-09 17:32:16')
INSERT INTO dbo.Job VALUES ('1012',NULL,'2009-07-09 19:51:28')
INSERT INTO dbo.Job VALUES ('1000',NULL,'2009-07-10 15:09:42')
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-10 16:15:31')
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-10 21:55:43')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-11 08:49:03')
INSERT INTO dbo.Job VALUES ('1022',NULL,'2009-07-11 16:47:21')
INSERT INTO dbo.Job VALUES ('1026',NULL,'2009-07-11 18:23:16')
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-11 19:49:31')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-12 11:57:26')
INSERT INTO dbo.Job VALUES ('1003',NULL,'2009-07-13 08:32:20')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-13 09:31:32')
INSERT INTO dbo.Job VALUES ('1021',NULL,'2009-07-14 09:52:54')
INSERT INTO dbo.Job VALUES ('1021',NULL,'2009-07-14 11:22:31')
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-14 11:54:14')
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-14 15:17:08')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-15 13:27:08')
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-15 14:10:56')
INSERT INTO dbo.Job VALUES ('1011',NULL,'2009-07-15 15:20:50')
INSERT INTO dbo.Job VALUES ('1028',NULL,'2009-07-15 15:39:18')
INSERT INTO dbo.Job VALUES ('1012',NULL,'2009-07-15 16:06:17')
INSERT INTO dbo.Job VALUES ('1017',NULL,'2009-07-16 11:52:08')

SET NOCOUNT OFF
GO

推荐答案

忽略 LineNumber 为空.在这种情况下应该如何处理 IsRepeat?

Ignores LineNumber is null. How should IsRepeat be handled in that case?

它适用于测试数据.它的效率是否足以满足生产量的需求?

It works for test data. Whether it will be efficient enough for production volumes?

在pair上有重复(LineNumber, CreatedOn)的情况下,任意选择一个.(JobId 最小的那个)

In the case of duplicate (LineNumber, CreatedOn) on pairs, arbitrarily choose one. (The one with minimum JobId)

基本思路:

  1. 获取所有的 JobId 对至少相隔 7 天,由行号.
  2. 数一数超过 7 天的行从左侧,向上和包括右侧.(CNT)
  3. 然后我们知道如果 JobId x 不是重复的,那么下一个不是重复的是带有 X 的对左侧,CNT = 1
  4. 使用递归 CTE 从每个 LineNumber 的第一行开始
  5. 递归元素使用带有计数的对来获取下一行.
  6. 最后更新,将所有 IsRepeat 设置为 0(非重复)和 1(其他).

<小时>

; with AllPairsByLineNumberAtLeast7DaysApart (LineNumber
            , LeftJobId
            , RightJobId
            , BeginCreatedOn
            , EndCreatedOn) as
        (select l.LineNumber
            , l.JobId
            , r.JobId
            , dateadd(day, 7, l.CreatedOn)
            , r.CreatedOn
        from Job l
        inner join Job r
            on l.LineNumber = r.LineNumber
            and dateadd(day, 7, l.CreatedOn) < r.CreatedOn
            and l.JobId <> r.JobId)
    -- Count the number of rows within from BeginCreatedOn 
    -- up to and including EndCreatedOn
    -- In the case of CreatedOn = EndCreatedOn, 
    -- include only jobId <= jobid, to handle ties in CreatedOn        
    , AllPairsCount(LineNumber, LeftJobId, RightJobId, Cnt) as
        (select ap.LineNumber, ap.LeftJobId, ap.RightJobId, count(*)
        from AllPairsByLineNumberAtLeast7DaysApart ap
        inner join Job j
            on j.LineNumber = ap.LineNumber
            and ap.BeginCreatedOn <= j.createdOn
            and (j.CreatedOn < ap.EndCreatedOn
                or (j.CreatedOn = ap.EndCreatedOn 
                    and j.JobId <= ap.RightJobId))
         group by ap.LineNumber, ap.LeftJobId, ap.RightJobId)
    , Step1 (LineNumber, JobId, CreatedOn, RN) as
        (select LineNumber, JobId, CreatedOn
            , row_number() over 
                (partition by LineNumber order by CreatedOn, JobId)
        from Job)
    , Results (JobId, LineNumber, CreatedOn) as    
        -- Start with the first rows.
        (select JobId, LineNumber, CreatedOn
        from Step1
        where RN = 1
        and LineNumber is not null
        -- get the next row
        union all
        select j.JobId, j.LineNumber, j.CreatedOn
        from Results r
        inner join AllPairsCount apc on apc.LeftJobId = r.JobId
        inner join Job j
            on j.JobId = apc.RightJobId
            and apc.CNT = 1)
    update j
    set IsRepeat = case when R.JobId is not null then 0 else 1 end
    from Job j
    left outer join Results r
        on j.JobId = R.JobId
    where j.LineNumber is not null

昨晚我关掉电脑后,我意识到我让事情变得比他们需要的更复杂.一个更直接(在测试数据上,稍微更有效)的查询:

After I turned off the computer last night I realized I had made things more complicated than they needed to be. A more straightforward (and on the test data, slightly more effecient) query:

基本思路:

  1. Generated PotentialStep (FromJobId, ToJobId) 这些是如果 FromJobId不是重复,比 ToJobId 也不是重复.(第一行按 LineNumber 更多距 FromJobId 超过 7 天)
  2. 使用递归 CTE 从每个 LineNumber 的第一个 JobId 开始,然后步进,使用 PontentialSteps,到每个非重复 JobId

<小时>

; with PotentialSteps (FromJobId, ToJobId) as
    (select FromJobId, ToJobId
    from (select f.JobId as FromJobId
            , t.JobId as ToJobId
            , row_number() over
                 (partition by f.LineNumber order by t.CreatedOn, t.JobId) as RN
        from Job f
        inner join Job t
            on f.LineNumber = t.LineNumber
            and dateadd(day, 7, f.CreatedOn) < t.CreatedOn) t
        where RN = 1)
, NonRepeats (JobId) as
    (select JobId
    from (select JobId
            , row_number() over
                (partition by LineNumber order by CreatedOn, JobId) as RN
        from Job) Start
    where RN = 1
    union all
    select J.JobId
    from NonRepeats NR
    inner join PotentialSteps PS
        on NR.JobId = PS.FromJobId
    inner join Job J
        on PS.ToJobId = J.JobId)
update J
set IsRepeat = case when NR.JobId is not null then 0 else 1 end
from Job J
left outer join NonRepeats NR
on J.JobId = NR.JobId
where J.LineNumber is not null

这篇关于标记重复记录的 T-SQL 查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆