使用非等号的自定义联接 [英] custom join with non equal keys
问题描述
我需要实现一个自定义的连接策略,该策略将匹配非严格相等的键. 为了说明这一点,可以考虑一下距离:联接应该在键足够接近的情况下发生(尽管在我的情况下,它比仅是距离度量要复杂一些)
I need to implement a custom join strategy, that would match for non strictly equal keys. To illustrate, one can think about distance : the join should occur when the keys are close enough (although in my case, it s a bit more complicated than just a distance metric)
因此,由于没有相等性,所以我不能通过覆盖相等性来实现此目的(并且我需要对其他需求进行真实的相等性测试).而且我想我还需要实现一个适当的分区程序.
So I can't implement this by overriding equals, since there's no equality (and I need to keep a true equality test for other needs). And I suppose i also need to implement a proper partitioner.
我该怎么做?
推荐答案
将RDD转换为DataFrames,然后可以进行如下连接:
Convert the RDDs to DataFrames, then you can do a join like this:
val newDF = leftDF.join(rightDF, $"col1" < ceilingVal and $"col1" > floorVal)
然后,您可以定义可在联接中使用的UDF.因此,如果您有这样的"distanceUDF":
You can then define UDFs that you can use in your join. So if you had a "distanceUDF" like this:
val distanceUDF = udf[Int, Int, Int]((val1, val2) => val2 - val1)
您可以这样做:
val newDF = leftDF.join(rightDF, distanceUDF($"left.colX", $"right.colY") < 10)
这篇关于使用非等号的自定义联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!