Hive:解决非等左连接 [英] Hive: work around for non equi left join
问题描述
Hive 不支持非对等连接:常见的解决方法是将连接条件移动到 where 子句中,当您需要内部连接时,这可以正常工作.但是左连接呢?
Hive does not support non equi joins: The common work around is to move the join condition to the where clause, which work fine when you want an inner join. but what about a left join?
人为的例子.假设我们有一个 orderLineItem 表,我们需要加入一个 ProductPrice 表,该表具有 productID、price &价格适用的日期范围.我们想加入其中 ProductID=ProductID &开始日期和结束日期之间的 OrderDate.如果 productID 或有效日期范围不匹配,我仍然希望查看所有 orderLineItem.
Contrived example. Let say we have an orderLineItem table, and we need to join to a ProductPrice table that has a productID, price & a date range for which the price applies. We want to join to this where ProductID=ProductID & OrderDate between start and End date. If a productID or a valid date range do not match, I'd still want to see all orderLineItems.
这个 SQL 小提琴是我们如何在 MSSQL 中执行此操作的示例:http://sqlfiddle.com/#!6/fb877/7
This SQL fiddle is an example of how we'd do this in MSSQL: http://sqlfiddle.com/#!6/fb877/7
问题如果我应用典型的解决方法,并将 non equi 过滤器移动到 where 子句,它就变成了内连接.在上面的例子中,在 sql fiddle &下面,我有一个不在查找中的产品 ID.
Problem If I apply the typical workaround, and move the non equi filter to the where clause, it becomes an inner join. In the case above, in the sql fiddle & below, I have a product ID that is not in the lookup.
问题:如果 hive 不支持非 eqi-joins,如何实现左非 eqi?
Question: Provided hive does not support non eqi-joins, How can a left non-eqi be achieved ?
[SQLFiddle 内容]
[SQLFiddle Content]
表格:
CREATE TABLE OrderLineItem(
LineItemIDId int IDENTITY(1,1),
OrderID int NOT NULL,
ProductID int NOT NULL,
OrderDate Date
);
CREATE TABLE ProductPrice(
ProductID int,
Cost float,
startDate Date,
EndDate Date
);
加载数据 &我们将如何加入 MSSQL:
loading The data & how we'd join in MSSQL:
--Old Price. Should be ignored
INSERT INTO ProductPrice(ProductID, COST,startDate,EndDate) VALUES (1, 50,'12/1/2012','1/1/2013');
INSERT INTO ProductPrice(ProductID, COST,startDate,EndDate) VALUES (2, 55,'12/1/2012','1/1/2013');
--Price for Order 2. Should be applied to Order 1
INSERT INTO ProductPrice (ProductID, COST,startDate,EndDate) VALUES(1, 20,'12/1/2013','1/1/2014');
INSERT INTO ProductPrice (ProductID, COST,startDate,EndDate) VALUES(2, 25,'12/1/2013','1/1/2014');
--Price for Order 2. Should be applied to Order 2
INSERT INTO ProductPrice (ProductID, COST,startDate,EndDate) VALUES(1, 15,'1/2/2014','3/1/2014');
INSERT INTO ProductPrice (ProductID, COST,startDate,EndDate) VALUES(2, 20,'1/2/2014','3/1/2014');
--January 1st 2014 Order
INSERT INTO OrderLineItem(OrderID,ProductID,OrderDate) VALUES (1, 1,'1/1/2014') ;
INSERT INTO OrderLineItem(OrderID,ProductID,OrderDate) VALUES (1, 2,'1/1/2014');
--Feb 1st 2014 Order
INSERT INTO OrderLineItem(OrderID,ProductID,OrderDate) VALUES (2, 1,'2/1/2014');
INSERT INTO OrderLineItem (OrderID,ProductID,OrderDate) VALUES(2, 2,'2/1/2014');
INSERT INTO OrderLineItem (OrderID,ProductID,OrderDate) VALUES(2, 3,'2/1/2014'); -- no price
SELECT * FROM OrderLineItem;
SELECT * FROM OrderLineItem li LEFT OUTER JOIN ProductPrice p on
p.ProductID=li.ProductID AND OrderDate BETWEEN startDate AND EndDate;
推荐答案
创建左表的副本并添加序列行号:
Create a copy of the left table with added serial row numbers:
CREATE TABLE OrderLineItem_serial AS
SELECT ROW_NUMBER() OVER() AS serial, * FROM OrderLineItem;
备注:这对于某些表格格式可能更有效(必须没有压缩):
Remark: This may work better for some tables formats (must be WITHOUT COMPRESSION):
CONCAT(INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE) AS serial
进行内连接:
Do an inner join:
CREATE TABLE OrderLineItem_inner AS
SELECT * FROM OrderLineItem_serial li JOIN ProductPrice p
on p.ProductID = li.ProductID WHERE OrderDate BETWEEN startDate AND EndDate;
按序列左连接:
Left join by serial:
SELECT * FROM OrderLineItem_serial li
LEFT OUTER JOIN OrderLineItem_inner i on li.serial = i.serial;
这篇关于Hive:解决非等左连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!