如何基于通配符/正则表达式条件在Spark中加入2个数据帧? [英] How to join 2 dataframes in Spark based on a wildcard/regex condition?

查看：126 发布时间：2020/9/4 4:59:05 scala apache-spark

本文介绍了如何基于通配符/正则表达式条件在Spark中加入2个数据帧?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有2个数据帧df1和df2. 假设df1中有一个location列，其中可能包含常规URL或带有通配符的URL，例如:

I have 2 dataframes df1 and df2. Suppose there is a location column in df1 which may contain a regular URL or a URL with a wildcard, e.g.:

stackoverflow.com/questions/*
*.cnn.com
cnn.com/*/politics

秒数据帧df2具有url字段，该字段可能仅包含有效的URL而没有通配符.

The seconds dataframe df2 has url field which may contain only valid URLs without wildcards.

我需要连接这两个数据帧，如果连接条件中有魔术matches运算符，则类似于df1.join(df2, $"location" matches $"url").

I need to join these two dataframes, something like df1.join(df2, $"location" matches $"url") if there was magic matches operator in join conditions.

经过一番谷歌搜索后，我仍然看不到如何实现这一目标的方法.您将如何解决此类问题?

After some googling I still don't see a way how to achieve this. How would you approach solving such problem?

推荐答案

存在魔术"匹配运算符-称为rlike

There exist "magic" matches operator - it is called rlike

val df1 = Seq("stackoverflow.com/questions/.*$","^*.cnn.com$", "nn.com/*/politics").toDF("location")
val df2 = Seq("stackoverflow.com/questions/47272330").toDF("url")

df2.join(df1, expr("url rlike location")).show
+--------------------+--------------------+
|                 url|            location|
+--------------------+--------------------+
|stackoverflow.com...|stackoverflow.com...|
+--------------------+--------------------+

但是有一些警告:

模式必须是正确的正则表达式，并在前导/尾随通配符的情况下锚定.
它与笛卡尔积(如何使用SQL风格的"LIKE"来联接两个Spark SQL数据帧一起执行.条件?):

== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, Inner, url#217 RLIKE location#211
:- *Project [value#215 AS url#217]
:  +- *Filter isnotnull(value#215)
:     +- LocalTableScan [value#215]
+- BroadcastExchange IdentityBroadcastMode
   +- *Project [value#209 AS location#211]
      +- *Filter isnotnull(value#209)
         +- LocalTableScan [value#209]

可以使用 Apache Spark中高效的字符串匹配中所述的方法来过滤候选对象

这篇关于如何基于通配符/正则表达式条件在Spark中加入2个数据帧?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何基于通配符/正则表达式条件在Spark中加入2个数据帧? [英] How to join 2 dataframes in Spark based on a wildcard/regex condition?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何基于通配符/正则表达式条件在Spark中加入2个数据帧? [英] How to join 2 dataframes in Spark based on a wildcard/regex condition?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭