查找处理多个作业/订单的总时间,每个工人和作业/订单的重叠/重叠时间 [英] Find total time worked with multiple jobs / orders with overlap / overlapping times on each worker and job / order

查看:18
本文介绍了查找处理多个作业/订单的总时间,每个工人和作业/订单的重叠/重叠时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我第一次进入 sql 世界时,我夜以继日地搜索这个问题的答案.找不到任何与我需要类似的东西,所以我决定提出并回答我自己的问题,以防其他人像我一样需要帮助.

I searched night and day back when I was first starting out in the sql world for an answer to this question. Could not find anything similar to this for my needs so I decided to ask and answer my own question in case others need help like I did.

这是我拥有的数据示例.为简单起见,全部来自 Job 表.每个 JobID 都有自己的开始和结束时间,它们基本上是随机的,可以重叠、有间隙、与其他作业同时开始和结束等.

Here is an example of the data I have. For simplicity, it is all from the Job table. Each JobID has it's own Start and End time that are basically random and can overlap, have gaps, start and end at the same time as other jobs etc.

--Available--
JobID  WorkerID  JobStart             JobEnd
1      25        '2012-11-17 16:00'  '2012-11-17 17:00'
2      25        '2012-11-18 16:00'  '2012-11-18 16:50'
3      25        '2012-11-19 18:00'  '2012-11-19 18:30'
4      25        '2012-11-19 17:30'  '2012-11-19 18:10'
5      26        '2012-11-18 16:00'  '2012-11-18 17:10'
6      26        '2012-11-19 16:00'  '2012-11-19 16:50'

我希望查询的结果是:

WorkerID  TotalTime(in Mins)
25        170
26        120

忘记提及需要忽略重叠.基本上,这应该像对待小时工而不是承包商一样对待这些工人及其工作.就像我工作了两个工作 ID 并从下午 12:00 到下午 12:30 开始和完成它们一样,作为一名员工,我只会获得 30 分钟的报酬,而承包商可能会获得 60 分钟的报酬,因为他们的工作是单独对待的,并且获得每份工作的报酬.此查询的重点是分析数据库中与工人相关的工作,并需要找出该工人是否被视为雇员,在给定的时间内他的总工作时间是多少.

Forgot to mention that the overlaps need to be ignored. Basically this is supposed to treat these workers and their jobs like you would an hourly employee and not a contractor. Like if I worked two jobIDs and started and finished them both from 12:00pm to 12:30pm, as an employee I would only get paid for 30 mins, whereas a contractor would likely get paid 60 mins, since their jobs are treated individually and get paid per job. The point of this query is to analyze jobs in a database that are tied to a worker, and need to find out if that worker was treated as an employee, what would his total hours worked in a given set of time come out to be.

不会让我在 7 小时内回答我自己的问题,稍后将其移到那里.

won't let me answer my own question for 7 hours, will move it there later.

好的,现在回答问题.基本上,我使用临时表在我查找的工作的最小和最大日期时间之间建立每一分钟.

Ok, Answering Question now. Basically, I use temp table to build each minute between the min and max datetime of the jobs I am looking up.

IF OBJECT_ID('tempdb..#time') IS NOT NULL
BEGIN
drop table #time
END
DECLARE @FromDate AS DATETIME,
     @ToDate AS DATETIME,
     @Current AS DATETIME
SET @FromDate = '2012-11-17 16:00'
SET @ToDate = '2012-11-19 18:30'

create table #time  (cte_start_date datetime)
set @current = @FromDate
while (@current < @ToDate)
begin

insert into #time (cte_start_date)
values (@current)

set @current = DATEADD(n, 1, @current)

end

现在我有一个临时表中的所有分钟.现在我需要将所有 Job 表信息加入其中,并一次性选择出我需要的内容.

Now I have all the mins in a temp table. Now I need to join all the Job table info into it and select out what I need in one go.

SELECT J.WorkerID
,COUNT(DISTINCT t.cte_start_date) AS TotalTime
FROM #time AS t
INNER JOIN Job AS J ON t.cte_start_date >= J.JobStart AND t.cte_start_date < J.JobEnd --Thanks ErikE
GROUP BY J.WorkerID --Thanks Martin Parkin

drop table #time

这是非常简单的答案,很适合让某人开始.

That is the very simplified answer and is good to get someone started.

推荐答案

这个查询也可以完成这项工作.它的性能非常好(虽然执行计划看起来不太好,但实际 CPU 和 IO 击败了许多其他查询).

This query does the job as well. Its performance is very good (while the execution plan looks not so great, the actual CPU and IO beat many other queries).

看看它在 Sql Fiddle 中的工作.

WITH Times AS (
   SELECT DISTINCT
      H.WorkerID,
      T.Boundary
   FROM
      dbo.JobHistory H
      CROSS APPLY (VALUES (H.JobStart), (H.JobEnd)) T (Boundary)
), Groups AS (
   SELECT
      WorkerID,
      T.Boundary,
      Grp = Row_Number() OVER (PARTITION BY T.WorkerID ORDER BY T.Boundary) / 2
   FROM
      Times T
      CROSS JOIN (VALUES (1), (1)) X (Dup)
), Boundaries AS (
   SELECT
      G.WorkerID,
      TimeStart = Min(Boundary),
      TimeEnd = Max(Boundary)
   FROM
      Groups G
   GROUP BY
      G.WorkerID,
      G.Grp
   HAVING
      Count(*) = 2
)
SELECT
   B.WorkerID,
   WorkedMinutes = Sum(DateDiff(minute, 0, B.TimeEnd - B.TimeStart))
FROM
   Boundaries B
WHERE
   EXISTS (
      SELECT *
      FROM dbo.JobHistory H
      WHERE
         B.WorkerID = H.WorkerID
         AND B.TimeStart < H.JobEnd
         AND B.TimeEnd > H.JobStart
   )
GROUP BY
   WorkerID
;

使用 WorkerID、JobStart、JobEnd、JobID 上的聚集索引,以及上面的 7 行样本,为新的工人/作业数据创建一个模板,重复足够多的次数以产生一个包含 14,336 的表行,这里是性能结果.我已经在页面上包含了其他有效/正确的答案(到目前为止):

With a clustered index on WorkerID, JobStart, JobEnd, JobID, and with the sample 7 rows from the above fiddle a template for new worker/job data repeated enough times to yield a table with 14,336 rows, here are the performance results. I've included the other working/correct answers on the page (so far):

Author  CPU  Elapsed  Reads   Scans
------  ---  -------  ------  -----
  Erik  157    166      122       2
Gordon  375    378    106964  53251

我在不同(较慢)的服务器上进行了更详尽的测试(其中每个查询运行 25 次,每个指标的最佳和最差值被丢弃,其余 23 个值取平均值)并得到以下结果:

I did a more exhaustive test from a different (slower) server (where each query was run 25 times, the best and worst values for each metric were thrown out, and the remaining 23 values were averaged) and got the following:

Query     CPU   Duration  Reads   Notes
--------  ----  --------  ------  ----------------------------------
Erik 1    215   231       122     query as above
Erik 2    326   379       116     alternate technique with no EXISTS
Gordon 1  578   682       106847  from j
Gordon 2  584   673       106847  from dbo.JobHistory

我认为肯定会改进的替代技术.好吧,它节省了 6 次读取,但花费了更多的 CPU(这是有道理的).与其将每个时间片的开始/结束统计数据进行到底,最好只是根据原始数据重新计算哪些切片与 EXISTS 保持一致.可能是少数工人和许多作业的不同配置文件可能会改变不同查询的性能统计数据.

The alternate technique I thought to be sure to improve things. Well, it saved 6 reads, but cost a lot more CPU (which makes sense). Instead of carrying through the start/end statistics of each timeslice to the end, it is best just recalculating which slices to keep with the EXISTS against the original data. It may be that a different profile of few workers with many jobs could change the performance statistics for different queries.

如果有人想尝试,请使用我的小提琴中的 CREATE TABLEINSERT 语句,然后运行 ​​11 次:

In case anyone wants to try it, use the CREATE TABLE and INSERT statements from my fiddle and then run this 11 times:

INSERT dbo.JobHistory
SELECT
   H.JobID + A.MaxJobID,
   H.WorkerID + A.WorkerCount,
   DateAdd(minute, Elapsed + 45, JobStart),
   DateAdd(minute, Elapsed + 45, JobEnd)
FROM
   dbo.JobHistory H
   CROSS JOIN (
      SELECT
         MaxJobID = Max(JobID),
         WorkerCount = Max(WorkerID) - Min(WorkerID) + 1,
         Elapsed = DateDiff(minute, Min(JobStart), Min(JobEnd))
      FROM dbo.JobHistory
   ) A
;

我为此查询构建了另外两个解决方案,但最好的解决方案的性能大约翻了一番,但存在致命缺陷(未正确处理完全封闭的时间范围).另一个的统计数据非常高/差(我知道但不得不尝试).

I built two other solutions to this query but the best one with about double the performance had a fatal flaw (not correctly handling fully enclosed time ranges). The other had very high/bad statistics (which I knew but had to try).

说明

使用每一行的所有端点时间,通过复制每个端点时间,然后以每次与下一个可能时间配对的方式分组,构建所有可能感兴趣的时间范围的不同列表.将这些范围内经过的分钟数与任何实际工人的工作时间相加.

Using all the endpoint times from each row, build up a distinct list of all possible time ranges of interest by duplicating each endpoint time and then grouping in such a way as to pair each time with the next possible time. Sum the elapsed minutes of these ranges wherever they coincide with any actual worker's working time.

这篇关于查找处理多个作业/订单的总时间,每个工人和作业/订单的重叠/重叠时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆