使用 Spark DataFrame 获取列上的不同值 [英] Fetching distinct values on a column using Spark DataFrame

查看：84 发布时间：2021/11/14 22:16:07 scala apache-spark dataframe apache-spark-sql spark-dataframe

本文介绍了使用 Spark DataFrame 获取列上的不同值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用 Spark 1.6.1 版本，我需要在列上获取不同的值，然后在其上执行一些特定的转换.该列包含超过 5000 万条记录，并且可以变得更大.
我知道执行 distinct.collect() 会将调用带回驱动程序.目前我正在执行以下任务，有没有更好的方法?

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger.
I understand that doing a distinct.collect() will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach?

 import sqlContext.implicits._
 preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2)

 preProcessedData.select(ApplicationId).distinct.collect().foreach(x => {
   val applicationId = x.getAs[String](ApplicationId)
   val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId)
   // DO SOME TASK PER applicationId
 })

 preProcessedData.unpersist()

推荐答案

要在 Dataframe 中获取所有不同的值，您可以使用独特.正如您在文档中所见，该方法返回另一个 DataFrame.之后，您可以创建一个 UDF 以转换每条记录.

Well to obtain all different values in a Dataframe you can use distinct. As you can see in the documentation that method returns another DataFrame. After that you can create a UDF in order to transform each record.

例如:

val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("age", "salary")

// I obtain all different values. If you show you must see only {1, 3}
val distinctValuesDF = df.select(df("age")).distinct

// Define your udf. In this case I defined a simple function, but they can get complicated.
val myTransformationUDF = udf(value => value / 10)

// Run that transformation "over" your DataFrame
val afterTransformationDF = distinctValuesDF.select(myTransformationUDF(col("age")))

这篇关于使用 Spark DataFrame 获取列上的不同值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 Spark DataFrame 获取列上的不同值 [英] Fetching distinct values on a column using Spark DataFrame

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 Spark DataFrame 获取列上的不同值 [英] Fetching distinct values on a column using Spark DataFrame

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭