为什么 SparkSQL 在 SQL 查询中需要两个文字转义反斜杠? [英] Why does SparkSQL require two literal escape backslashes in the SQL query?

查看:158
本文介绍了为什么 SparkSQL 在 SQL 查询中需要两个文字转义反斜杠?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我从 Spark 2.0 REPL (spark-shell) 运行以下 Scala 代码时,它按照我的预期运行,使用简单的正则表达式拆分字符串.

When I run the below Scala code from the Spark 2.0 REPL (spark-shell), it runs as I intended it, splitting the string with a simple regular expression.

import org.apache.spark.sql.SparkSession

// Create session
val sparkSession = SparkSession.builder.master("local").getOrCreate()

// Use SparkSQL to split a string
val query = "SELECT split('What is this? A string I think', '\\\\?') AS result"
println("The query is: " + query)
val dataframe = sparkSession.sql(query)

// Show the result
dataframe.show(1, false)

给出预期的输出

+---------------------------------+
|result                           |
+---------------------------------+
|[What is this,  A string I think]|
+---------------------------------+

但我对是否需要使用不是单个而是双反斜杠来转义字面问号感到困惑(这里表示为四个反斜杠,因为我们当然必须在 Scala 中在不使用三引号时转义反斜杠).

But I am confused about the need to escape the literal question mark with not a single, but double backslash (here represented as four backslashes since we must of course escape backslashes in Scala when not using triple-quoting).

我确认我的一位同事为 Spark 1.5 编写的一些非常相似的代码使用单个(文字)反斜杠可以正常工作.但是如果我在 Spark 2.1 中只使用一个文字反斜杠,我会从 JVM 的正则表达式引擎中得到错误,"Dangling meta character '?'接近索引 0".我知道这意味着问号没有正确转义,但它闻起来像反斜杠本身必须首先转义为 Scala 和 然后 SQL.

I confirmed that some very similar code written by a colleague of mine for Spark 1.5 works just fine using a single (literal) backslash. But if I only use a single literal backslash in Spark 2.1, I get the error from the JVM's regex engine, "Dangling meta character '?' near index 0". I am aware this means the question mark was not escaped properly, but it smells like the backslash itself has to be escaped for first Scala and then SQL.

我猜这对于将控制字符(如换行符)插入 SQL 查询本身很有用.我只是很困惑这是否从 Spark 1.5 更改为 2.1?

I'm guessing that this can be useful for inserting control characters (like newline) into the SQL query itself. I'm just confused if this has changed somewhere from Spark 1.5 to 2.1?

我为此在谷歌上搜索了很多,但没有找到任何东西.要么发生了一些变化,要么我同事的代码以一种意想不到的方式工作.

I have googled quite a bit for this, but didn't find anything. Either something has changed, or my colleague's code works in an unintended way.

我也用 Python/pyspark 尝试过这个,同样的条件适用 - SQL 中需要双反斜杠.

I also tried this with Python/pyspark, and the same condition applies - double backslashes are needed in the SQL.

谁能解释一下?

我在 Windows 上运行一个相对简单的设置,使用 Spark 2.1.0、JDK 1.8.0_111 和 Hadoop winutils.exe.

I'm running on a relatively simple setup on Windows, with Spark 2.1.0, JDK 1.8.0_111, and the Hadoop winutils.exe.

推荐答案

可能是因为反斜杠是一个特殊符号,用于连接多行 SQL.

May be because backslash is a special symbol, used to concatenate multi-line SQLs.

sql_1 = spark.sql("SELECT \
    1 AS `col1`, '{0}' AS `col2`".format(var_1))

这篇关于为什么 SparkSQL 在 SQL 查询中需要两个文字转义反斜杠?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆