如何基于通配符/正则表达式条件在Spark中加入2个数据帧? [英] How to join 2 dataframes in Spark based on a wildcard/regex condition?

查看:126
本文介绍了如何基于通配符/正则表达式条件在Spark中加入2个数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个数据帧df1df2. 假设df1中有一个location列,其中可能包含常规URL或带有通配符的URL,例如:

I have 2 dataframes df1 and df2. Suppose there is a location column in df1 which may contain a regular URL or a URL with a wildcard, e.g.:

  • stackoverflow.com/questions/*
  • *.cnn.com
  • cnn.com/*/politics

秒数据帧df2具有url字段,该字段可能仅包含有效的URL而没有通配符.

The seconds dataframe df2 has url field which may contain only valid URLs without wildcards.

我需要连接这两个数据帧,如果连接条件中有魔术matches运算符,则类似于df1.join(df2, $"location" matches $"url").

I need to join these two dataframes, something like df1.join(df2, $"location" matches $"url") if there was magic matches operator in join conditions.

经过一番谷歌搜索后,我仍然看不到如何实现这一目标的方法.您将如何解决此类问题?

After some googling I still don't see a way how to achieve this. How would you approach solving such problem?

推荐答案

存在魔术"匹配运算符-称为rlike

There exist "magic" matches operator - it is called rlike

val df1 = Seq("stackoverflow.com/questions/.*$","^*.cnn.com$", "nn.com/*/politics").toDF("location")
val df2 = Seq("stackoverflow.com/questions/47272330").toDF("url")

df2.join(df1, expr("url rlike location")).show
+--------------------+--------------------+
|                 url|            location|
+--------------------+--------------------+
|stackoverflow.com...|stackoverflow.com...|
+--------------------+--------------------+

但是有一些警告:

== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, Inner, url#217 RLIKE location#211
:- *Project [value#215 AS url#217]
:  +- *Filter isnotnull(value#215)
:     +- LocalTableScan [value#215]
+- BroadcastExchange IdentityBroadcastMode
   +- *Project [value#209 AS location#211]
      +- *Filter isnotnull(value#209)
         +- LocalTableScan [value#209]

可以使用 Apache Spark中高效的字符串匹配中所述的方法来过滤候选对象

这篇关于如何基于通配符/正则表达式条件在Spark中加入2个数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆