在数组中选择一系列元素spark sql [英] selecting a range of elements in an array spark sql
问题描述
最近在spark-sql中加载了一个包含数组列的表。
$使用Spark-shell执行以下操作b $ b
以下是同样的ddl:
create table test_emp_arr {
dept_id string,
dept_nm字符串,
emp_details数组< string>
}
数据看起来像这样
+ ------- + ------- + ------------------- ------------ +
| dept_id | dept_nm | emp_details |
+ ------- + ------- + ----------------------------- - +
| 10 |财务| [Jon,Snow,Castle,Black,Ned] |
| 20 | IT | [奈德,是,不,更多] |
+ ------- + ------- + ----------------------------- - +
我可以查询emp_details列,如下所示:
sqlContext.sql(select emp_details [0] from emp_details)。show
问题
我想查询集合中的一系列元素:
预计查询工作
$ b
sqlContext.sql select emp_details [0-2] from emp_details)。show
或
sqlContext.sql(从emp_details选择emp_details [0:2])。show
$
预期产出
+ - ------------------ +
| emp_details |
+ ------------------- +
| [Jon,Snow,Castle] |
| [Ned,是,否] |
+ ------------------- +
在纯scala中,如果我有一个数组,例如:
val emp_details = Array(Jon,Snow ,Castle,Black)
我可以使用0到2范围内的元素
emp_details.slice(0,3)
返回给我
Array(Jon,Snow,Castle)
我无法在spark-sql中应用上述数组操作。任何帮助?
谢谢
解决方案用户定义的功能,它具有适用于任何需要的切片大小的优点。它只是在scala内置
slice
方法的基础上构建一个UDF函数:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val slice = udf((array:Seq [String],from:Int,to:Int )=> array.slice(from,to))
数据示例示例:
$ b $ pre $val df = sqlContext.sql(select array('Jon','Snow','Castle','Black' ,'Ned')as emp_details)
df.withColumn(slice,slice($emp_details,lit(0),lit(3)))。show
生成预期的输出
+ - ------------------- + ------------------- +
| emp_details |片|
+ -------------------- + ------------------- +
| [Jon,Snow,Castl ... | [Jon,Snow,Castle] |
+ -------------------- + ------------------- +
您也可以在您的
sqlContext
中注册UDF并使用它像这样
sqlContext.udf.register(slice,(array:Seq [String],from:Int,to: (array)('Jon','Snow','Castle','Black','Ned'),slice(数组) ('Jon','Snow','Castle','Black','Ned'),0,3))
此解决方案不再需要
点亮
I use Spark-shell to do the below operations
Recently loaded a table with an array column in spark-sql .
Here is the ddl for the same:
create table test_emp_arr{ dept_id string, dept_nm string, emp_details Array<string> }
the data looks something like this
+-------+-------+-------------------------------+ |dept_id|dept_nm| emp_details| +-------+-------+-------------------------------+ | 10|Finance|[Jon, Snow, Castle, Black, Ned]| | 20| IT| [Ned, is, no, more]| +-------+-------+-------------------------------+
i can query the emp_details column something like this :
sqlContext.sql("select emp_details[0] from emp_details").show
Problem
I want to query a range of elements in the collection :
Expected query to work
sqlContext.sql("select emp_details[0-2] from emp_details").show
or
sqlContext.sql("select emp_details[0:2] from emp_details").show
Expected output
+-------------------+ | emp_details| +-------------------+ |[Jon, Snow, Castle]| | [Ned, is, no]| +-------------------+
In pure scala if i have an array something as :
val emp_details = Array("Jon","Snow","Castle","Black")
i can get the elements from 0 to 2 range using
emp_details.slice(0,3)
returns me
Array(Jon, Snow,Castle)
I am not able to apply the above operation of the array in spark-sql . any help ?
Thanks
解决方案Here is a solution using a User Defined Function which has the advantage of working for any slice size you want. It simply builds a UDF function around the scala builtin
slice
method :import sqlContext.implicits._ import org.apache.spark.sql.functions._ val slice = udf((array : Seq[String], from : Int, to : Int) => array.slice(from,to))
Example with a sample of your data :
val df = sqlContext.sql("select array('Jon', 'Snow', 'Castle', 'Black', 'Ned') as emp_details") df.withColumn("slice", slice($"emp_details", lit(0), lit(3))).show
Produces the expected output
+--------------------+-------------------+ | emp_details| slice| +--------------------+-------------------+ |[Jon, Snow, Castl...|[Jon, Snow, Castle]| +--------------------+-------------------+
You can also register the UDF in your
sqlContext
and use it like thissqlContext.udf.register("slice", (array : Seq[String], from : Int, to : Int) => array.slice(from,to)) sqlContext.sql("select array('Jon','Snow','Castle','Black','Ned'),slice(array('Jon','Snow','Castle','Black','Ned'),0,3)")
You won't need
lit
anymore with this solution这篇关于在数组中选择一系列元素spark sql的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!