Spark 从一行中提取值 [英] Spark extracting values from a Row
问题描述
我有以下数据框
val transactions_with_counts = sqlContext.sql("""SELECT user_id AS user_id, category_id AS category_id,COUNT(category_id) FROM 交易 GROUP BY user_id, category_id""")
我正在尝试将行转换为 Rating 对象,但由于 x(0) 返回一个数组,因此失败
val ratings = transactions_with_counts.map(x => Rating(x(0).toInt, x(1).toInt, x(2).toInt))
<块引用>
错误:值 toInt 不是 Any 的成员
让我们从一些虚拟数据开始:
val transactions = Seq((1, 2), (1, 4), (2, 3)).toDF("user_id", "category_id")val transactions_with_counts = 交易.groupBy($"user_id", $"category_id").数数transactions_with_counts.printSchema//根//|-- user_id: integer (nullable = false)//|-- category_id: 整数 (nullable = false)//|-- count: long (nullable = false)
有几种方法可以访问 Row
值并保留预期类型:
模式匹配
import org.apache.spark.sql.RowTransaction_with_counts.map{case Row(user_id: Int, category_id: Int, rating: Long) =>评级(user_id,category_id,评级)}
类型化的
get*
方法,如getInt
、getLong
:transactions_with_counts.map(r=>评分(r.getInt(0), r.getInt(1), r.getLong(2)))
getAs
方法,可以使用名称和索引:transactions_with_counts.map(r => Rating(r.getAs[Int]("user_id"), r.getAs[Int]("category_id"), r.getAs[Long](2)))
它可用于正确提取用户定义的类型,包括
mllib.linalg.Vector
.显然,按名称访问需要一个架构.转换为静态类型的
Dataset
(Spark 1.6+/2.0+):transactions_with_counts.as[(Int, Int, Long)]
I have the following dataframe
val transactions_with_counts = sqlContext.sql(
"""SELECT user_id AS user_id, category_id AS category_id,
COUNT(category_id) FROM transactions GROUP BY user_id, category_id""")
I'm trying to convert the rows to Rating objects but since x(0) returns an array this fails
val ratings = transactions_with_counts
.map(x => Rating(x(0).toInt, x(1).toInt, x(2).toInt))
error: value toInt is not a member of Any
Lets start with some dummy data:
val transactions = Seq((1, 2), (1, 4), (2, 3)).toDF("user_id", "category_id")
val transactions_with_counts = transactions
.groupBy($"user_id", $"category_id")
.count
transactions_with_counts.printSchema
// root
// |-- user_id: integer (nullable = false)
// |-- category_id: integer (nullable = false)
// |-- count: long (nullable = false)
There are a few ways to access Row
values and keep expected types:
Pattern matching
import org.apache.spark.sql.Row transactions_with_counts.map{ case Row(user_id: Int, category_id: Int, rating: Long) => Rating(user_id, category_id, rating) }
Typed
get*
methods likegetInt
,getLong
:transactions_with_counts.map( r => Rating(r.getInt(0), r.getInt(1), r.getLong(2)) )
getAs
method which can use both names and indices:transactions_with_counts.map(r => Rating( r.getAs[Int]("user_id"), r.getAs[Int]("category_id"), r.getAs[Long](2) ))
It can be used to properly extract user defined types, including
mllib.linalg.Vector
. Obviously accessing by name requires a schema.Converting to statically typed
Dataset
(Spark 1.6+ / 2.0+):transactions_with_counts.as[(Int, Int, Long)]
这篇关于Spark 从一行中提取值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!