在Apache Spark中将Dataframe的列值提取为列表 [英] Extract column values of Dataframe as List in Apache Spark
问题描述
我想将数据框的字符串列转换为列表.我可以从Dataframe
API中找到RDD,因此我尝试先将其转换回RDD,然后将toArray
函数应用于RDD.在这种情况下,长度和SQL都可以正常工作.但是,我从RDD获得的结果在每个元素周围都有方括号,例如[A00001]
.我想知道是否有适当的方法可以将列转换为列表,也可以删除方括号.
I want to convert a string column of a data frame to a list. What I can find from the Dataframe
API is RDD, so I tried converting it back to RDD first, and then apply toArray
function to the RDD. In this case, the length and SQL work just fine. However, the result I got from RDD has square brackets around every element like this [A00001]
. I was wondering if there's an appropriate way to convert a column to a list or a way to remove the square brackets.
任何建议将不胜感激.谢谢!
Any suggestions would be appreciated. Thank you!
推荐答案
这应该返回包含单个列表的集合:
This should return the collection containing single list:
dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()
没有映射,您只会得到一个Row对象,其中包含数据库中的每一列.
Without the mapping, you just get a Row object, which contains every column from the database.
请记住,这可能会为您提供任何类型的列表. Ï如果要指定结果类型,可以在r => r(0).asInstanceOf[YOUR_TYPE]
映射中使用.asInstanceOf [YOUR_TYPE]
Keep in mind that this will probably get you a list of Any type. Ïf you want to specify the result type, you can use .asInstanceOf[YOUR_TYPE] in r => r(0).asInstanceOf[YOUR_TYPE]
mapping
P.S.由于自动转换,您可以跳过.rdd
部分.
P.S. due to automatic conversion you can skip the .rdd
part.
这篇关于在Apache Spark中将Dataframe的列值提取为列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!