从spark DataFrame提取`Seq [(String,String,String)]` [英] Extracting `Seq[(String,String,String)]` from spark DataFrame

查看:884
本文介绍了从spark DataFrame提取`Seq [(String,String,String)]`的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个火花DF,其行Seq[(String, String, String)].我正在尝试使用flatMap进行某种操作,但是我尝试执行的任何操作最终都会抛出

I have a spark DF with rows of Seq[(String, String, String)]. I'm trying to do some kind of a flatMap with that but anything I do try ends up throwing

java.lang.ClassCastException:org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema无法转换为scala.Tuple3

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple3

我可以从DF中获取单行或多行

I can take a single row or multiple rows from the DF just fine

df.map{ r => r.getSeq[Feature](1)}.first

返回

Seq[(String, String, String)] = WrappedArray([ancient,jj,o], [olympia_greece,nn,location] .....

和RDD的数据类型似乎正确.

and the data type of the RDD seems correct.

org.apache.spark.rdd.RDD[Seq[(String, String, String)]]

df的架构是

root
 |-- article_id: long (nullable = true)
 |-- content_processed: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- lemma: string (nullable = true)
 |    |    |-- pos_tag: string (nullable = true)
 |    |    |-- ne_tag: string (nullable = true)

我知道这个问题与将RDD行视为org.apache.spark.sql.Row的spark sql有关,即使他们愚蠢地说这是Seq[(String, String, String)].有一个相关的问题(下面的链接),但该问题的答案对我不起作用.我对Spark还不太熟悉,无法弄清楚如何将其变成可行的解决方案.

I know this problem is related to spark sql treating the RDD rows as org.apache.spark.sql.Row even though they idiotically say that it's a Seq[(String, String, String)]. There's a related question (link below) but the answer to that question doesn't work for me. I am also not familiar enough with spark to figure out how to turn it into a working solution.

Row[Seq[(String, String, String)]]Row[(String, String, String)]Seq[Row[(String, String, String)]]还是更疯狂的行?

Are the rows Row[Seq[(String, String, String)]] or Row[(String, String, String)] or Seq[Row[(String, String, String)]] or something even crazier.

我正在尝试做类似的事情

I'm trying to do something like

df.map{ r => r.getSeq[Feature](1)}.map(_(1)._1)

似乎有效,但实际上不起作用

which appears to work but doesn't actually

df.map{ r => r.getSeq[Feature](1)}.map(_(1)._1).first

引发以上错误.那么我应该如何(例如)获得每行第二个元组的第一个元素?

throws the above error. So how am I supposed to (for instance) get the first element of the second tuple on each row?

为什么也是为此目的而设计的,声称某事物实际上是一种类型,而实际上却不是,也不能转换为声明的类型,这似乎是愚蠢的.

Also WHY has spark been designed to do this, it seems idiotic to claim that something is of one type when in fact it isn't and can not be converted to the claimed type.

相关问题:将ArrayBuffer转换为从Hive表中将DataFrame中的HashSet转换为RDD

相关的错误报告:推荐答案

好吧,它并不声称它是一个元组.它声称它是一个struct映射到Row:

Well, it doesn't claim that it is a tuple. It claims it is a struct which maps to Row:

import org.apache.spark.sql.Row

case class Feature(lemma: String, pos_tag: String, ne_tag: String)
case class Record(id: Long, content_processed: Seq[Feature])

val df = Seq(
  Record(1L, Seq(
    Feature("ancient", "jj", "o"),
    Feature("olympia_greece", "nn", "location")
  ))
).toDF

val content = df.select($"content_processed").rdd.map(_.getSeq[Row](0))

您会在 Spark中找到确切的映射规则. SQL编程指南.

由于Row并非完全漂亮的结构,因此您可能需要将其映射到有用的内容:

Since Row is not exactly pretty structure you'll probably want to map it to something useful:

content.map(_.map {
  case Row(lemma: String, pos_tag: String, ne_tag: String) => 
    (lemma, pos_tag, ne_tag)
})

或:

content.map(_.map ( row => (
  row.getAs[String]("lemma"),
  row.getAs[String]("pos_tag"),
  row.getAs[String]("ne_tag")
)))

最后,使用Datasets的方法更为简洁:

Finally a slightly more concise approach with Datasets:

df.as[Record].rdd.map(_.content_processed)

df.select($"content_processed").as[Seq[(String, String, String)]]

尽管目前看来这似乎有点问题.

although this seems to be slightly buggy at this moment.

第一种方法(Row.getAs)和第二种方法(Dataset.as)有一个重要的区别.前一个将对象提取为Any并应用asInstanceOf.后者是使用编码器在内部类型和所需表示形式之间进行转换.

There is important difference the first approach (Row.getAs) and the second one (Dataset.as). The former one extract objects as Any and applies asInstanceOf. The latter one is using encoders to transform between internal types and desired representation.

这篇关于从spark DataFrame提取`Seq [(String,String,String)]`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆