计算不匹配并丢失 [英] Count the mismatch and missing

查看:100
本文介绍了计算不匹配并丢失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面是TestingTable1中的数据

  BUYER_ID | ITEM_ID | CREATED_TIME 
----------- + ------------------- + ------------- -----------
1345653 110909316904 2012-07-09 21:29:06
1345653 151851771618 2012-07-09 19:57:33
1345653 221065796761 2012 -07-09 19:31:48
1345653 400307563710 2012-07-09 18:57:33
1345653 310411560125 2012-07-09 16:09:49
1345653 120945302103 2012-07 -09 13:40:23
1345653 261060982989 2012-07-09 09:02:21

以下是TestingTable2中的数据

  USER_ID | PRODUCT_ID | LAST_TIME 
----------- + ------------------- + ------------- ------
1345653 110909316904 2012-07-09 21:30:06
1345653 152851771618 2012-07-09 19:57:33
1345653 221065796761 2012-07-09 19 :31:48
1345653 400307563710 2012-07-09 18:57:33

I需要比较 TestingTable2 TestingTable1 BUYER_ID USER_ID 。我需要从 TestingTable1 进行比较后找到 TestingTable2 中缺失和不匹配的项目(基本上是计数)。我为此创建了SQL小提琴 -



http ://sqlfiddle.com/#!3 / d87b2 / 1



如果您在SQL Fiddle中运行我的查询,您将得到as-

  BUYER_ID错误
1345653 5

$在$ TestingTable1 中缺少 $ b

这是最后一个三个行。 c $ c> TestingTable1 on BUYER_ID USER_ID



现在复杂的事情开始了。

问题陈述 -



在我当前的输出中,作为 5 。因此,如果您在表格中看到第一行 ITEM_ID PRODUCT_ID 相同但 CREATED_TIME 和 LAST_TIME 不相同,并且这两次之间的差异仅为 1分钟。所以目前我报告说这是不匹配,但我需要的是,如果它们之间的差异在 15分钟范围内,那么我不想报告为错误。因此,在我当前的查询中实现此功能后,我将获得错误计数为4,因为差异在第一行的15分钟范围内。

所以我该如何实现这个功能在我目前的查询中?这是我的问题。



P.S-我正与Hive和Hive一起支持sql语法。所以我认为任何修改都可以在我当前的查询中运行。

解决方案

您可以修改SQL Server SQL Fiddle查询,上班。但是,我不确定Hive是否支持datediff

  SELECT TT.BUYER_ID,COUNT(*)FROM 
(SELECT testingtable1.buyer_id,testingtable1.item_id,testingtable1.created_time FROM
testingtable2 RIGHT JOIN testingtable1
ON(testingtable1.item_id = testingtable2.product_id
AND testingtable1.BUYER_ID = testingtable2.USER_ID
AND abs(datediff(mi,testingtable1.created_time,testingtable2.last_time))<= 15)
where testingtable2.product_id IS NULL)TT GROUP BY TT.BUYER_ID;


Below is the data in TestingTable1

BUYER_ID   |   ITEM_ID         |    CREATED_TIME
-----------+-------------------+------------------------
1345653        110909316904         2012-07-09 21:29:06
1345653        151851771618         2012-07-09 19:57:33
1345653        221065796761         2012-07-09 19:31:48
1345653        400307563710         2012-07-09 18:57:33
1345653        310411560125         2012-07-09 16:09:49
1345653        120945302103         2012-07-09 13:40:23
1345653        261060982989         2012-07-09 09:02:21

Below is the data in TestingTable2

USER_ID   |   PRODUCT_ID           |    LAST_TIME
-----------+-------------------+-------------------
1345653       110909316904         2012-07-09 21:30:06
1345653       152851771618         2012-07-09 19:57:33
1345653       221065796761         2012-07-09 19:31:48
1345653       400307563710         2012-07-09 18:57:33

I need to Compare TestingTable2 with TestingTable1 on BUYER_ID and USER_ID. And I need to find all (basically the count) the missing and mismatch entries in TestingTable2 after comparing from TestingTable1. I created SQL fiddle for this-

http://sqlfiddle.com/#!3/d87b2/1

If you run my query in the SQL Fiddle, you will get output as-

BUYER_ID    ERROR
1345653       5

which is right as last three rows from TestingTable1 is missing in TestingTable2 and rest two are mismatch after comparison from TestingTable1 on BUYER_ID and USER_ID.

Now the complicated thing is starting.

Problem Statement-

In my current output, I am getting ERROR count as 5. So if you see first row in both the tables ITEM_ID and PRODUCT_ID are same but CREATED_TIME and LAST_TIME is not same, and difference between those two times is of only 1 minute. So currently I am reporting that as a mismatch, but what I need is that if the difference between them is within 15 minutes range, then I don't want to report as an error. So after implementing this feature in my current query, I will be getting error count as 4 because difference is within 15 minute range for the first row.

So how can I implement this feature in my current query? That's my question.

P.S- I am working with Hive and Hive supports sql like syntax. So I think any modification will work in my current query.

解决方案

your SQL Server SQL Fiddle query could be modified as follows and made to work. However, i am not sure if Hive supports datediff

SELECT TT.BUYER_ID , COUNT(*) FROM
(SELECT testingtable1.buyer_id, testingtable1.item_id, testingtable1.created_time FROM
 testingtable2 RIGHT JOIN testingtable1 
 ON (testingtable1.item_id = testingtable2.product_id 
 AND testingtable1.BUYER_ID = testingtable2.USER_ID 
 AND abs(datediff(mi, testingtable1.created_time,testingtable2.last_time)) <= 15)
 where testingtable2.product_id IS NULL) TT GROUP BY TT.BUYER_ID;

这篇关于计算不匹配并丢失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆