使用非等号的自定义联接 [英] custom join with non equal keys

查看:82
本文介绍了使用非等号的自定义联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要实现一个自定义的连接策略,该策略将匹配非严格相等的键. 为了说明这一点,可以考虑一下距离:联接应该在键足够接近的情况下发生(尽管在我的情况下,它比仅是距离度量要复杂一些)

I need to implement a custom join strategy, that would match for non strictly equal keys. To illustrate, one can think about distance : the join should occur when the keys are close enough (although in my case, it s a bit more complicated than just a distance metric)

因此,由于没有相等性,所以我不能通过覆盖相等性来实现此目的(并且我需要对其他需求进行真实的相等性测试).而且我想我还需要实现一个适当的分区程序.

So I can't implement this by overriding equals, since there's no equality (and I need to keep a true equality test for other needs). And I suppose i also need to implement a proper partitioner.

我该怎么做?

推荐答案

将RDD转换为DataFrames,然后可以进行如下连接:

Convert the RDDs to DataFrames, then you can do a join like this:

val newDF = leftDF.join(rightDF, $"col1" < ceilingVal and $"col1" > floorVal)

然后,您可以定义可在联接中使用的UDF.因此,如果您有这样的"distanceUDF":

You can then define UDFs that you can use in your join. So if you had a "distanceUDF" like this:

val distanceUDF = udf[Int, Int, Int]((val1, val2) => val2 - val1)

您可以这样做:

val newDF = leftDF.join(rightDF, distanceUDF($"left.colX", $"right.colY") < 10)

这篇关于使用非等号的自定义联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆