计算Spark SQL DSL中的字符串长度 [英] compute string length in Spark SQL DSL

查看:1260
本文介绍了计算Spark SQL DSL中的字符串长度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个关于Spark 1.2的老问题

this is an old question concerning Spark 1.2

出于orderBy的目的,我一直在尝试快速计算SchemaRDD中字符串列的长度.我正在学习Spark SQL,所以我的问题严格来说是关于使用Spark SQL公开的DSL或SQL接口,还是要了解它们的局限性.

I've been trying to compute on the fly the length of a string column in a SchemaRDD for orderBy purposes. I am learning Spark SQL so my question is strictly about using the DSL or the SQL interface that Spark SQL exposes, or to know their limitations.

例如,我的第一个尝试就是使用集成的关系查询

My first attempt has been to use the integrated relational queries, for instance

notes.select('note).orderBy(length('note))

在编译时没有运气:

error: not found: value length

(这让我想知道在哪里可以找到该DSL可以真正解析的表达式".例如,它为列添加解析了"+".)

(Which makes me wonder where to find what "Expression" this DSL can actually resolve. For instance, it resolves "+" for column additions.)

然后我尝试了

sql("SELECT note, length(note) as len FROM notes")

此操作失败

java.util.NoSuchElementException: key not found: length

(然后我重新阅读此内容(我正在运行1.2.0) http://spark.apache. org/docs/1.2.0/sql-programming-guide.html#supported-hive-features 并想知道Spark SQL在何种意义上支持列出的配置单元功能.)

(Then I reread this (I'm running 1.2.0) http://spark.apache.org/docs/1.2.0/sql-programming-guide.html#supported-hive-features and wonder in what sense Spark SQL supports the listed hive features.)

问题:表达式和/或SQL语句中是否真正支持长度运算符?如果是,语法是什么? (奖金:是否有关于Spark SQL表达式中解析的内容以及一般的语法的特定文档?)

Questions: is the length operator really supported in Expressions and/or in SQL statements? If yes, what is the syntax? (bonus: is there a specific documentation about what is resolved in Spark SQL Expressions, and what would be the syntax in general?)

谢谢!

推荐答案

在Spark Shell中尝试一下:

Try this in Spark Shell:

case class Note(id:Int,text:String)
val notes=List(Note(1,"One"),Note(2,"Two"),Note(3,"Three"))
val notesRdd=sc.parallelize(notes)
import org.apache.spark.sql.hive.HiveContext
val hc=new HiveContext(sc)
import hc.createSchemaRDD
notesRdd.registerTempTable("note")
hc.sql("select id, text, length(text) from note").foreach(println)

它可以通过设置来运行(带有hadoop 2.4的开箱即用火花1.2.1):

It works on by setup (out of the box spark 1.2.1 with hadoop 2.4):

[2,Two,3]
[1,One,3]
[3,Three,5]

这篇关于计算Spark SQL DSL中的字符串长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆