为什么SparkSQL在SQL查询中需要两个文字转义反斜杠? [英] Why does SparkSQL require two literal escape backslashes in the SQL query?

查看:121
本文介绍了为什么SparkSQL在SQL查询中需要两个文字转义反斜杠?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我从Spark 2.0 REPL(spark-shell)运行以下Scala代码时,它将按我的预期运行,并使用简单的正则表达式拆分字符串.

When I run the below Scala code from the Spark 2.0 REPL (spark-shell), it runs as I intended it, splitting the string with a simple regular expression.

import org.apache.spark.sql.SparkSession

// Create session
val sparkSession = SparkSession.builder.master("local").getOrCreate()

// Use SparkSQL to split a string
val query = "SELECT split('What is this? A string I think', '\\\\?') AS result"
println("The query is: " + query)
val dataframe = sparkSession.sql(query)

// Show the result
dataframe.show(1, false)

提供预期的输出

+---------------------------------+
|result                           |
+---------------------------------+
|[What is this,  A string I think]|
+---------------------------------+

但是,我对是否需要用单反斜杠而不是单反斜杠(此处表示为四个反斜杠)进行转义的问题感到困惑(因为当不使用三引号时,我们当然必须在Scala中对反斜杠进行转义).

But I am confused about the need to escape the literal question mark with not a single, but double backslash (here represented as four backslashes since we must of course escape backslashes in Scala when not using triple-quoting).

我确认我的一位同事为Spark 1.5 编写的一些非常相似的代码使用单个(文字)反斜杠即可正常工作.但是,如果我在Spark 2.1中仅使用单个文字反斜杠,则会从JVM的正则表达式引擎中收到错误,"Dangling meta character'?"接近索引0" .我知道这意味着问号未正确转义,但闻起来像是对于第一个Scala和然后 SQL,必须转义反斜杠本身.

I confirmed that some very similar code written by a colleague of mine for Spark 1.5 works just fine using a single (literal) backslash. But if I only use a single literal backslash in Spark 2.1, I get the error from the JVM's regex engine, "Dangling meta character '?' near index 0". I am aware this means the question mark was not escaped properly, but it smells like the backslash itself has to be escaped for first Scala and then SQL.

我猜想这对于在SQL查询本身中插入控制字符(例如换行符)很有用.如果这已经从Spark 1.5更改为2.1,我只是感到困惑?

I'm guessing that this can be useful for inserting control characters (like newline) into the SQL query itself. I'm just confused if this has changed somewhere from Spark 1.5 to 2.1?

对此我已经用Google搜索了很多,但是什么也没找到.要么发生了某些变化,要么我的同事的代码无法正常工作.

I have googled quite a bit for this, but didn't find anything. Either something has changed, or my colleague's code works in an unintended way.

我也使用Python/pyspark尝试了此操作,并且应用了相同的条件-SQL中需要双反斜杠.

I also tried this with Python/pyspark, and the same condition applies - double backslashes are needed in the SQL.

有人可以解释吗?

我正在Windows上使用Spark 2.1.0,JDK 1.8.0_111和Hadoop winutils.exe进行相对简单的安装.

I'm running on a relatively simple setup on Windows, with Spark 2.1.0, JDK 1.8.0_111, and the Hadoop winutils.exe.

推荐答案

可能是因为反斜杠是特殊符号,用于连接多行SQL.

May be because backslash is a special symbol, used to concatenate multi-line SQLs.

sql_1 = spark.sql("SELECT \
    1 AS `col1`, '{0}' AS `col2`".format(var_1))

这篇关于为什么SparkSQL在SQL查询中需要两个文字转义反斜杠?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆