使用Spark DataFrame在列上获取不同的值 [英] Fetching distinct values on a column using Spark DataFrame
问题描述
使用Spark 1.6.1版本,我需要在列上获取不同的值,然后在其上执行一些特定的转换.该列包含超过5000万条记录,并且可以不断扩大.
我知道执行distinct.collect()
将把调用带回到驱动程序.目前,我正在执行以下任务,是否有更好的方法?
Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger.
I understand that doing a distinct.collect()
will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach?
import sqlContext.implicits._
preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2)
preProcessedData.select(ApplicationId).distinct.collect().foreach(x => {
val applicationId = x.getAs[String](ApplicationId)
val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId)
// DO SOME TASK PER applicationId
})
preProcessedData.unpersist()
推荐答案
Well to obtain all different values in a Dataframe
you can use distinct. As you can see in the documentation that method returns another DataFrame
. After that you can create a UDF
in order to transform each record.
例如:
val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("age", "salary")
// I obtain all different values. If you show you must see only {1, 3}
val distinctValuesDF = df.select(df("age")).distinct
// Define your udf. In this case I defined a simple function, but they can get complicated.
val myTransformationUDF = udf(value => value / 10)
// Run that transformation "over" your DataFrame
val afterTransformationDF = distinctValuesDF.select(myTransformationUDF(col("age")))
这篇关于使用Spark DataFrame在列上获取不同的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!