Spark SQL 不区分大小写的列条件过滤器 [英] Spark SQL case insensitive filter for column conditions
问题描述
如何使用 Spark SQL 过滤器作为不区分大小写的过滤器.
How to use Spark SQL filter as a case insensitive filter.
例如:
dataFrame.filter(dataFrame.col("vendor").equalTo("fortinet"));
只返回 'vendor'
列等于 'fortinet'
的行,但我想要 'vendor'
列等于 'fortinet'
或 'Fortinet'
或 'fortinet'
或 ...
just return rows that 'vendor'
column is equal to 'fortinet'
but i want rows that 'vendor'
column equal to 'fortinet'
or 'Fortinet'
or 'foRtinet'
or ...
推荐答案
您可以使用不区分大小写的正则表达式:
You can either use case-insensitive regex:
val df = sc.parallelize(Seq(
(1L, "Fortinet"), (2L, "foRtinet"), (3L, "foo")
)).toDF("k", "v")
df.where($"v".rlike("(?i)^fortinet$")).show
// +---+--------+
// | k| v|
// +---+--------+
// | 1|Fortinet|
// | 2|foRtinet|
// +---+--------+
或与 lower
/upper
的简单相等:
or simple equality with lower
/ upper
:
import org.apache.spark.sql.functions.{lower, upper}
df.where(lower($"v") === "fortinet")
// +---+--------+
// | k| v|
// +---+--------+
// | 1|Fortinet|
// | 2|foRtinet|
// +---+--------+
df.where(upper($"v") === "FORTINET")
// +---+--------+
// | k| v|
// +---+--------+
// | 1|Fortinet|
// | 2|foRtinet|
// +---+--------+
对于简单的过滤器,我更喜欢 rlike
虽然性能应该相似,但对于 join
条件相等是更好的选择.请参阅我们如何使用 SQL 风格的LIKE"连接两个 Spark SQL 数据帧?标准?了解详情.
For simple filters I would prefer rlike
although performance should be similar, for join
conditions equality is a much better choice. See How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion? for details.
这篇关于Spark SQL 不区分大小写的列条件过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!