Spark Scala-拆分字符串语法问题 [英] Spark Scala - splitting string syntax issue

查看:126
本文介绍了Spark Scala-拆分字符串语法问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用SparkSQL和Scala在DataFrame列中拆分String,而且拆分条件对两者的作用方式似乎有所不同

i'm trying to split String in a DataFrame column using SparkSQL and Scala, and there seems to be a difference in the way the split condition is working the two

使用Scala,

这有效-

val seq = Seq("12.1")
val df = seq.toDF("val")

Scala代码->

Scala Code ->

val seq = Seq("12.1")
val df = seq.toDF("val")

val afterSplit = df2.withColumn("FirstPart", split($"val", "\\.")).select($"FirstPart".getItem(0).as("PartOne"))
afterSplit.show(false)

但是,在我使用Spark SQL时,firstParkSQL显示为空白.

However, in Spark SQL when i use this, firstParkSQL shows a Blank.

df.registerTempTable("temp")
val s1 = sqlContext.sql("select split(val, '\\.')[0] as firstPartSQL from temp")

相反,当我使用此参数时(表示为[.]而不是\的单独条件.期望值出现.

Instead, when i use this (separate condition represented as [.] instead of \. expected value shows up.

val s1 = sqlContext.sql("select split(val, '[.]')[0] as firstPartSQL from temp")

任何想法为什么会这样?

Any ideas why this is happening ?

推荐答案

在spark-sql中使用双引号 spark.sql(".....")的正则表达式模式时,它被视为另一个字符串中的一个字符串,因此发生了两件事.考虑一下

When you use regex patterns in spark-sql with double quotes spark.sql("....."),it is considered as string within another string, so two things happen. Consider this

scala> val df = Seq("12.1").toDF("val")
df: org.apache.spark.sql.DataFrame = [val: string]

scala> df.withColumn("FirstPart", split($"val", "\\.")).select($"FirstPart".getItem(0).as("PartOne")).show
+-------+
|PartOne|
+-------+
|     12|
+-------+


scala> df.createOrReplaceTempView("temp")

使用df(),用于split的正则表达式字符串直接传递到split字符串,因此您只需要单独转义反斜杠(\).

With df(), the regex-string for split is directly passed to the split string, so you just need to escape the backslash alone (\).

但是当涉及到spark-sql时,该模式首先被转换为字符串,然后再次作为字符串传递给split()函数,因此,您需要先获取 \\.,然后再在spark-sql中使用它

But when it comes to spark-sql, the pattern is first converted into string and then again passed as string to split() function, So you need to get \\. before you use that in the spark-sql

获取方法是再添加2个 \

The way to get that is to add 2 more \

scala> "\\."
res12: String = \.

scala> "\\\\."
res13: String = \\.

scala>

如果仅在spark-sql中传递"\\." ,则首先将其转换为 \.,然后转换为.",在正则表达式上下文中成为(.)任何"字符即在任何字符上进行拆分,并且由于每个字符彼此相邻,因此您会得到一个由空字符串组成的数组.字符串"12.1"的长度为四,并且也与字符串的最终边界"$"匹配..因此,直到split(val,'\.')[4],您将获得空字符串.当您发出split(val,'\.,')[5]时,您会得到 null

If you just pass "\\." in the spark-sql, first it gets converted into \. and then to ".", which in regex context becomes (.) "any" character i.e split on any character, and since each character is adjacent to each other, you get an array of empty string. The length of the string "12.1" is four and also it matches the final boundary "$" of the string as well.. so upto split(val, '\.')[4], you'll get the empty string. When you issue split(val, '\.,')[5], you'll get null

要验证这一点,您可以将相同的定界符字符串"\\." 传递给regex_replace()函数,看看会发生什么情况

To verify this, you can pass the same delimiter string "\\." to regex_replace() function and see what happens

scala> spark.sql("select split(val, '\\.')[0] as firstPartSQL, regexp_replace(val,'\\.','9') as reg_ex from temp").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
|            |  9999|
+------------+------+

scala> spark.sql("select split(val, '\\\\.')[0] as firstPartSQL, regexp_replace(val,'\\\\.','9') as reg_ex from temp").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
|          12|  1291|
+------------+------+


scala>

如果您仍然想在df和sql之间使用相同的模式,请使用原始字符串,即三引号.

If you still want to use the same pattern between df and sql, then go with raw string i.e triple quotes.

scala> raw"\\."
res23: String = \\.

scala>

scala> spark.sql("""select split(val, '\\.')[0] as firstPartSQL, regexp_replace(val,'\\.','9') as reg_ex from temp""").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
|          12|  1291|
+------------+------+


scala> spark.sql("""select split(val, "\\.")[0] as firstPartSQL, regexp_replace(val,"\\.",'9') as reg_ex from temp""").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
|          12|  1291|
+------------+------+


scala>

这篇关于Spark Scala-拆分字符串语法问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆