如何将数据帧列转换为序列 [英] How to convert a dataframe column to sequence

查看：159 发布时间：2017/3/26 2:18:57 scala apache-spark dataframe

本文介绍了如何将数据帧列转换为序列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据框如下：

+-----+--------------------+
|LABEL|                TERM|
+-----+--------------------+
|    4|  inhibitori_effect|
|    4|    novel_therapeut|
|    4| antiinflammator...|
|    4|    promis_approach|
|    4|      cell_function|
|    4|          cell_line|
|    4|        cancer_cell|

我想通过将所有术语作为序列创建一个新的数据框，以便我可以使用它们与Word2vec 。那就是：

I want to create a new dataframe by taking all terms as sequence so that I can use them with Word2vec. That is:

+-----+--------------------+
|LABEL|                TERM|
+-----+--------------------+
|    4|  inhibitori_effect, novel_therapeut,..., cell_line |

因此，我想应用这里给出的示例代码： https://spark.apache.org/docs/latest/ml-features.html#word2vec

As a result I want to apply this sample code as given here: https://spark.apache.org/docs/latest/ml-features.html#word2vec

到目前为止，我已经尝试将df转换为RDD并将其映射。然后我无法将其重新转换为df。

So far I have tried to convert df to RDD and map it. And then I could not manage to re-convert it to a df.

提前感谢。

编辑：

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext

val sc = new SparkContext(conf)
    val sqlContext: SQLContext = new HiveContext(sc)  

    val df = sqlContext.load("jdbc",Map(
      "url" -> "jdbc:oracle:thin:...",
      "dbtable" -> "table"))

    df.show(20)

    df.groupBy($"label").agg(collect_list($"term").alias("term"))

推荐答案

您可以使用 collect_list 或 collect_set 函数：

import org.apache.spark.sql.functions.{collect_list, collect_set}

df.groupBy($"label").agg(collect_list($"term").alias("term"))

在Spark< 2.0它需要 HiveContext ，而在Spark 2.0+中，您必须在 SessionBuilder 中启用配置单元支持。请参阅在Spark SQL中使用collect_list和collect_set

In Spark < 2.0 it requires HiveContext and in Spark 2.0+ you have to enable hive support in SessionBuilder. See Use collect_list and collect_set in Spark SQL

这篇关于如何将数据帧列转换为序列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何将数据帧列转换为序列 [英] How to convert a dataframe column to sequence

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何将数据帧列转换为序列 [英] How to convert a dataframe column to sequence

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭