星火从行提取值 [英] Spark extracting values from a Row

查看:141
本文介绍了星火从行提取值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据框中

  VAL transactions_with_counts = sqlContext.sql(
  选择USER_ID AS USER_ID,AS CATEGORY_ID CATEGORY_ID,
  COUNT(CATEGORY_ID)FROM交易GROUP BY user_ID的,CATEGORY_ID)

我想要的行评对象转换,但由于X(0)返回一个数组失败

  VAL收视率= transactions_with_counts
  .MAP(X =>评分(X(0).toInt中,x(1).toInt中,x(2).toInt))


  

错误:值toInt是不是任何一个成员



解决方案

让我们用一些虚拟数据启动:

  VAL交易= sc.parallelize(SEQ(
  (1,2),(1,4),(2,3)))。toDF(USER_ID,CATEGORY_ID)VAL transactions_with_counts =交易
  .groupBy($user_ID的$CATEGORY_ID)
  。计数transactions_with_counts.printSchema// 根
// | - USER_ID:整数(可为空= FALSE)
// | - CATEGORY_ID:整数(可为空= FALSE)
// | - 计数:长(可为空= FALSE)

有访问的几种方法的价值观和保持预期的类型:


  1. 模式匹配

     进口org.apache.spark.sql.Rowtransactions_with_counts.map {
      行的情况下(USER_ID:智力,CATEGORY_ID:智力,等级:长)=>
        评级(USER_ID,CATEGORY_ID,等级)
    }


  2. 键入的获得* 方法,如调用getInt getLong之

      transactions_with_counts.map(
      R =>等级(r.getInt(0),r.getInt(1),r.getLong(2))


  3. getAs 方法,它可以同时使用名称和指标:

      transactions_with_counts.map(R = GT;评级(
      r.getAs [INT](USER_ID),r.getAs [INT](CATEGORY_ID),r.getAs [龙](2)
    ))

    它可以被用于正确提取用户定义类型,包括 mllib.linalg.Vector 。显然,通过名字来访问需要的模式。


I have the following dataframe

val transactions_with_counts = sqlContext.sql(
  """SELECT user_id AS user_id, category_id AS category_id,
  COUNT(category_id) FROM transactions GROUP BY user_id, category_id""")

I'm trying to convert the rows to Rating objects but since x(0) returns an array this fails

val ratings = transactions_with_counts
  .map(x => Rating(x(0).toInt, x(1).toInt, x(2).toInt))

error: value toInt is not a member of Any

解决方案

Lets start with some dummy data:

val transactions = sc.parallelize(Seq(
  (1, 2), (1, 4), (2, 3))).toDF("user_id", "category_id")

val transactions_with_counts = transactions
  .groupBy($"user_id", $"category_id")
  .count

transactions_with_counts.printSchema

// root
// |-- user_id: integer (nullable = false)
// |-- category_id: integer (nullable = false)
// |-- count: long (nullable = false)

There are a few ways to access Row values and keep expected types:

  1. Pattern matching

    import org.apache.spark.sql.Row
    
    transactions_with_counts.map{
      case Row(user_id: Int, category_id: Int, rating: Long) =>
        Rating(user_id, category_id, rating)
    } 
    

  2. Typed get* methods like getInt, getLong:

    transactions_with_counts.map(
      r => Rating(r.getInt(0), r.getInt(1), r.getLong(2))
    )
    

  3. getAs method which can use both names and indices:

    transactions_with_counts.map(r => Rating(
      r.getAs[Int]("user_id"), r.getAs[Int]("category_id"), r.getAs[Long](2)
    ))
    

    It can be used to properly extract user defined types, including mllib.linalg.Vector. Obviously accessing by name requires a schema.

这篇关于星火从行提取值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆