从spark DataFrame提取`Seq [(String,String,String)]` [英] Extracting `Seq[(String,String,String)]` from spark DataFrame
问题描述
我有一个火花DF,其行Seq[(String, String, String)]
.我正在尝试使用flatMap
进行某种操作,但是我尝试执行的任何操作最终都会抛出
I have a spark DF with rows of Seq[(String, String, String)]
. I'm trying to do some kind of a flatMap
with that but anything I do try ends up throwing
java.lang.ClassCastException:org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema无法转换为scala.Tuple3
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple3
我可以从DF中获取单行或多行
I can take a single row or multiple rows from the DF just fine
df.map{ r => r.getSeq[Feature](1)}.first
返回
Seq[(String, String, String)] = WrappedArray([ancient,jj,o], [olympia_greece,nn,location] .....
和RDD的数据类型似乎正确.
and the data type of the RDD seems correct.
org.apache.spark.rdd.RDD[Seq[(String, String, String)]]
df的架构是
root
|-- article_id: long (nullable = true)
|-- content_processed: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- lemma: string (nullable = true)
| | |-- pos_tag: string (nullable = true)
| | |-- ne_tag: string (nullable = true)
我知道这个问题与将RDD行视为org.apache.spark.sql.Row
的spark sql有关,即使他们愚蠢地说这是Seq[(String, String, String)]
.有一个相关的问题(下面的链接),但该问题的答案对我不起作用.我对Spark还不太熟悉,无法弄清楚如何将其变成可行的解决方案.
I know this problem is related to spark sql treating the RDD rows as org.apache.spark.sql.Row
even though they idiotically say that it's a Seq[(String, String, String)]
. There's a related question (link below) but the answer to that question doesn't work for me. I am also not familiar enough with spark to figure out how to turn it into a working solution.
行Row[Seq[(String, String, String)]]
或Row[(String, String, String)]
或Seq[Row[(String, String, String)]]
还是更疯狂的行?
Are the rows Row[Seq[(String, String, String)]]
or Row[(String, String, String)]
or Seq[Row[(String, String, String)]]
or something even crazier.
我正在尝试做类似的事情
I'm trying to do something like
df.map{ r => r.getSeq[Feature](1)}.map(_(1)._1)
似乎有效,但实际上不起作用
which appears to work but doesn't actually
df.map{ r => r.getSeq[Feature](1)}.map(_(1)._1).first
引发以上错误.那么我应该如何(例如)获得每行第二个元组的第一个元素?
throws the above error. So how am I supposed to (for instance) get the first element of the second tuple on each row?
为什么也是为此目的而设计的,声称某事物实际上是一种类型,而实际上却不是,也不能转换为声明的类型,这似乎是愚蠢的.
Also WHY has spark been designed to do this, it seems idiotic to claim that something is of one type when in fact it isn't and can not be converted to the claimed type.
相关问题:将ArrayBuffer转换为从Hive表中将DataFrame中的HashSet转换为RDD