Spark 从一行中提取值 [英] Spark extracting values from a Row

查看：36 发布时间：2021/11/28 21:44:21 scala apache-spark apache-spark-sql

本文介绍了Spark 从一行中提取值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下数据框

val transactions_with_counts = sqlContext.sql("""SELECT user_id AS user_id, category_id AS category_id,COUNT(category_id) FROM 交易 GROUP BY user_id, category_id""")

我正在尝试将行转换为 Rating 对象，但由于 x(0) 返回一个数组，因此失败

val ratings = transactions_with_counts.map(x => Rating(x(0).toInt, x(1).toInt, x(2).toInt))

<块引用>

错误:值 toInt 不是 Any 的成员

解决方案

让我们从一些虚拟数据开始:

val transactions = Seq((1, 2), (1, 4), (2, 3)).toDF("user_id", "category_id")val transactions_with_counts = 交易.groupBy($"user_id", $"category_id").数数transactions_with_counts.printSchema//根//|-- user_id: integer (nullable = false)//|-- category_id: 整数 (nullable = false)//|-- count: long (nullable = false)

有几种方法可以访问 Row 值并保留预期类型:

模式匹配

import org.apache.spark.sql.RowTransaction_with_counts.map{case Row(user_id: Int, category_id: Int, rating: Long) =>评级(user_id，category_id，评级)}

类型化的 get* 方法，如 getInt、getLong:

transactions_with_counts.map(r＝＞评分(r.getInt(0), r.getInt(1), r.getLong(2)))

getAs 方法，可以使用名称和索引:
```
transactions_with_counts.map(r => Rating(r.getAs[Int]("user_id"), r.getAs[Int]("category_id"), r.getAs[Long](2)))
```
它可用于正确提取用户定义的类型，包括mllib.linalg.Vector.显然，按名称访问需要一个架构.
转换为静态类型的Dataset(Spark 1.6+/2.0+):
```
transactions_with_counts.as[(Int, Int, Long)]
```

I have the following dataframe

val transactions_with_counts = sqlContext.sql(
  """SELECT user_id AS user_id, category_id AS category_id,
  COUNT(category_id) FROM transactions GROUP BY user_id, category_id""")

I'm trying to convert the rows to Rating objects but since x(0) returns an array this fails

val ratings = transactions_with_counts
  .map(x => Rating(x(0).toInt, x(1).toInt, x(2).toInt))

error: value toInt is not a member of Any

解决方案

Lets start with some dummy data:

val transactions = Seq((1, 2), (1, 4), (2, 3)).toDF("user_id", "category_id")

val transactions_with_counts = transactions
  .groupBy($"user_id", $"category_id")
  .count

transactions_with_counts.printSchema

// root
// |-- user_id: integer (nullable = false)
// |-- category_id: integer (nullable = false)
// |-- count: long (nullable = false)

There are a few ways to access Row values and keep expected types:

Pattern matching

import org.apache.spark.sql.Row

transactions_with_counts.map{
  case Row(user_id: Int, category_id: Int, rating: Long) =>
    Rating(user_id, category_id, rating)
}

Typed get* methods like getInt, getLong:

transactions_with_counts.map(
  r => Rating(r.getInt(0), r.getInt(1), r.getLong(2))
)

getAs method which can use both names and indices:
```
transactions_with_counts.map(r => Rating(
  r.getAs[Int]("user_id"), r.getAs[Int]("category_id"), r.getAs[Long](2)
))
```
It can be used to properly extract user defined types, including mllib.linalg.Vector. Obviously accessing by name requires a schema.
Converting to statically typed Dataset (Spark 1.6+ / 2.0+):
```
transactions_with_counts.as[(Int, Int, Long)]
```

这篇关于Spark 从一行中提取值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark 从一行中提取值 [英] Spark extracting values from a Row

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark 从一行中提取值 [英] Spark extracting values from a Row

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭