在数组中选择一系列元素spark sql [英] selecting a range of elements in an array spark sql

查看:641
本文介绍了在数组中选择一系列元素spark sql的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



最近在spark-sql中加载了一个包含数组列的表。


$使用Spark-shell执行以下操作b $ b

以下是同样的ddl:

  create table test_emp_arr {
dept_id string,
dept_nm字符串,
emp_details数组< string>
}

数据看起来像这样

  + ------- + ------- + ------------------- ------------ + 
| dept_id | dept_nm | emp_details |
+ ------- + ------- + ----------------------------- - +
| 10 |财务| [Jon,Snow,Castle,Black,Ned] |
| 20 | IT | [奈德,是,不,更多] |
+ ------- + ------- + ----------------------------- - +

我可以查询emp_details列,如下所示:

  sqlContext.sql(select emp_details [0] from emp_details)。show 

问题



我想查询集合中的一系列元素:



预计查询工作
$ b

  sqlContext.sql select emp_details [0-2] from emp_details)。show 

  sqlContext.sql(从emp_details选择emp_details [0:2])。show 


预期产出

  +  - ------------------ + 
| emp_details |
+ ------------------- +
| [Jon,Snow,Castle] |
| [Ned,是,否] |
+ ------------------- +

在纯scala中,如果我有一个数组,例如:

  val emp_details = Array(Jon,Snow ,Castle,Black)

我可以使用0到2范围内的元素

  emp_details.slice(0,3)

返回给我

  Array(Jon,Snow,Castle)

我无法在spark-sql中应用上述数组操作。任何帮助?



谢谢

解决方案

用户定义的功能,它具有适用于任何需要的切片大小的优点。它只是在scala内置 slice 方法的基础上构建一个UDF函数:

  import sqlContext.implicits._ 
import org.apache.spark.sql.functions._

val slice = udf((array:Seq [String],from:Int,to:Int )=> array.slice(from,to))

数据示例示例:


$ b $ pre $ val df = sqlContext.sql(select array('Jon','Snow','Castle','Black' ,'Ned')as emp_details)
df.withColumn(slice,slice($emp_details,lit(0),lit(3)))。show



生成预期的输出

  +  - ------------------- + ------------------- + 
| emp_details |片|
+ -------------------- + ------------------- +
| [Jon,Snow,Castl ... | [Jon,Snow,Castle] |
+ -------------------- + ------------------- +

您也可以在您的 sqlContext 中注册UDF并使用它像这样

  sqlContext.udf.register(slice,(array:Seq [String],from:Int,to: (array)('Jon','Snow','Castle','Black','Ned'),slice(数组) ('Jon','Snow','Castle','Black','Ned'),0,3))

此解决方案不再需要点亮


I use Spark-shell to do the below operations

Recently loaded a table with an array column in spark-sql .

Here is the ddl for the same:

create table test_emp_arr{
    dept_id string,
    dept_nm string,
    emp_details Array<string>
}

the data looks something like this

+-------+-------+-------------------------------+
|dept_id|dept_nm|                     emp_details|
+-------+-------+-------------------------------+
|     10|Finance|[Jon, Snow, Castle, Black, Ned]|
|     20|     IT|            [Ned, is, no, more]|
+-------+-------+-------------------------------+

i can query the emp_details column something like this :

sqlContext.sql("select emp_details[0] from emp_details").show

Problem

I want to query a range of elements in the collection :

Expected query to work

sqlContext.sql("select emp_details[0-2] from emp_details").show

or

sqlContext.sql("select emp_details[0:2] from emp_details").show

Expected output

+-------------------+
|        emp_details|
+-------------------+
|[Jon, Snow, Castle]|
|      [Ned, is, no]|
+-------------------+

In pure scala if i have an array something as :

val emp_details = Array("Jon","Snow","Castle","Black")

i can get the elements from 0 to 2 range using

emp_details.slice(0,3)

returns me

Array(Jon, Snow,Castle)

I am not able to apply the above operation of the array in spark-sql . any help ?

Thanks

解决方案

Here is a solution using a User Defined Function which has the advantage of working for any slice size you want. It simply builds a UDF function around the scala builtin slice method :

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val slice = udf((array : Seq[String], from : Int, to : Int) => array.slice(from,to))

Example with a sample of your data :

val df = sqlContext.sql("select array('Jon', 'Snow', 'Castle', 'Black', 'Ned') as emp_details")
df.withColumn("slice", slice($"emp_details", lit(0), lit(3))).show

Produces the expected output

+--------------------+-------------------+
|         emp_details|              slice|
+--------------------+-------------------+
|[Jon, Snow, Castl...|[Jon, Snow, Castle]|
+--------------------+-------------------+

You can also register the UDF in your sqlContext and use it like this

sqlContext.udf.register("slice", (array : Seq[String], from : Int, to : Int) => array.slice(from,to))
sqlContext.sql("select array('Jon','Snow','Castle','Black','Ned'),slice(array('Jon‌​','Snow','Castle','Black','Ned'),0,3)")

You won't need lit anymore with this solution

这篇关于在数组中选择一系列元素spark sql的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆