如何基于通配符/正则表达式条件在Spark中加入2个数据帧? [英] How to join 2 dataframes in Spark based on a wildcard/regex condition?
问题描述
我有2个数据帧df1
和df2
.
假设df1
中有一个location
列,其中可能包含常规URL或带有通配符的URL,例如:
I have 2 dataframes df1
and df2
.
Suppose there is a location
column in df1
which may contain a regular URL or a URL with a wildcard, e.g.:
- stackoverflow.com/questions/*
- *.cnn.com
- cnn.com/*/politics
秒数据帧df2
具有url
字段,该字段可能仅包含有效的URL而没有通配符.
The seconds dataframe df2
has url
field which may contain only valid URLs without wildcards.
我需要连接这两个数据帧,如果连接条件中有魔术matches
运算符,则类似于df1.join(df2, $"location" matches $"url")
.
I need to join these two dataframes, something like df1.join(df2, $"location" matches $"url")
if there was magic matches
operator in join conditions.
经过一番谷歌搜索后,我仍然看不到如何实现这一目标的方法.您将如何解决此类问题?
After some googling I still don't see a way how to achieve this. How would you approach solving such problem?
推荐答案
存在魔术"匹配运算符-称为rlike
There exist "magic" matches operator - it is called rlike
val df1 = Seq("stackoverflow.com/questions/.*$","^*.cnn.com$", "nn.com/*/politics").toDF("location")
val df2 = Seq("stackoverflow.com/questions/47272330").toDF("url")
df2.join(df1, expr("url rlike location")).show
+--------------------+--------------------+
| url| location|
+--------------------+--------------------+
|stackoverflow.com...|stackoverflow.com...|
+--------------------+--------------------+
但是有一些警告:
- 模式必须是正确的正则表达式,并在前导/尾随通配符的情况下锚定.
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, Inner, url#217 RLIKE location#211
:- *Project [value#215 AS url#217]
: +- *Filter isnotnull(value#215)
: +- LocalTableScan [value#215]
+- BroadcastExchange IdentityBroadcastMode
+- *Project [value#209 AS location#211]
+- *Filter isnotnull(value#209)
+- LocalTableScan [value#209]
可以使用 Apache Spark中高效的字符串匹配中所述的方法来过滤候选对象
这篇关于如何基于通配符/正则表达式条件在Spark中加入2个数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!