选择数组中的一系列元素spark sql [英] selecting a range of elements in an array spark sql
问题描述
我使用 spark-shell
进行以下操作.
I use spark-shell
to do the below operations.
最近在 spark-sql 中加载了一个包含数组列的表.
Recently loaded a table with an array column in spark-sql .
这是相同的 DDL:
create table test_emp_arr{
dept_id string,
dept_nm string,
emp_details Array<string>
}
数据看起来像这样
+-------+-------+-------------------------------+
|dept_id|dept_nm| emp_details|
+-------+-------+-------------------------------+
| 10|Finance|[Jon, Snow, Castle, Black, Ned]|
| 20| IT| [Ned, is, no, more]|
+-------+-------+-------------------------------+
我可以像这样查询 emp_details 列:
I can query the emp_details column something like this :
sqlContext.sql("select emp_details[0] from emp_details").show
问题
我想查询集合中的一系列元素:
I want to query a range of elements in the collection :
预期查询有效
sqlContext.sql("select emp_details[0-2] from emp_details").show
或
sqlContext.sql("select emp_details[0:2] from emp_details").show
预期输出
+-------------------+
| emp_details|
+-------------------+
|[Jon, Snow, Castle]|
| [Ned, is, no]|
+-------------------+
在纯 Scala 中,如果我有一个数组:
In pure Scala, if i have an array something as :
val emp_details = Array("Jon","Snow","Castle","Black")
我可以使用
emp_details.slice(0,3)
还给我
Array(Jon, Snow,Castle)
我无法在 spark-sql 中应用上述数组操作.
I am not able to apply the above operation of the array in spark-sql.
谢谢
推荐答案
这是一个使用 用户定义函数,它的优点是可以处理您想要的任何切片大小.它只是围绕 scala 内置的 slice
方法构建一个 UDF 函数:
Here is a solution using a User Defined Function which has the advantage of working for any slice size you want. It simply builds a UDF function around the scala builtin slice
method :
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val slice = udf((array : Seq[String], from : Int, to : Int) => array.slice(from,to))
以您的数据样本为例:
val df = sqlContext.sql("select array('Jon', 'Snow', 'Castle', 'Black', 'Ned') as emp_details")
df.withColumn("slice", slice($"emp_details", lit(0), lit(3))).show
产生预期的输出
+--------------------+-------------------+
| emp_details| slice|
+--------------------+-------------------+
|[Jon, Snow, Castl...|[Jon, Snow, Castle]|
+--------------------+-------------------+
您也可以在 sqlContext
中注册 UDF 并像这样使用它
You can also register the UDF in your sqlContext
and use it like this
sqlContext.udf.register("slice", (array : Seq[String], from : Int, to : Int) => array.slice(from,to))
sqlContext.sql("select array('Jon','Snow','Castle','Black','Ned'),slice(array('Jon','Snow','Castle','Black','Ned'),0,3)")
使用此解决方案,您将不再需要 lit
You won't need lit
anymore with this solution
这篇关于选择数组中的一系列元素spark sql的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!