如何使用Apache Spark Scala获取大型CSV / RDD [Array [double]]中的所有列的直方图？ [英] How to get Histogram of all columns in a large CSV / RDD[Array[double]] using Apache Spark Scala?

查看：275 发布时间：2017/2/24 20:14:27 scala csv apache-spark histogram rdd

本文介绍了如何使用Apache Spark Scala获取大型CSV / RDD [Array [double]]中的所有列的直方图？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图使用Spark Scala从CSV文件计算所有列的直方图。

I am trying to calculate Histogram of all columns from a CSV file using Spark Scala.

我发现DoubleRDDFunctions支持直方图。

I found that DoubleRDDFunctions supporting Histogram. So I coded like following for getting histogram of all columns.

获取列计数

创建 RDD [double] ，并使用<$ c $>计算每个 RDD 的直方图c> DoubleRDDFunctions

Get column count
Create RDD[double] of each column and calculate Histogram of each RDD using DoubleRDDFunctions

var columnIndexArray = Array.tabulate(rdd.first().length) (_ * 1)

val histogramData = columnIndexArray.map(columns => {
  rdd.map(lines => lines(columns)).histogram(6) 
})

任何人都可以提出更好的方法来解决这个问题？

Is it a good way ? Can anyone suggest some better ways to tackle this ?

提前感谢。

推荐答案

不是更好，但另一种方法是将RDD转换为DataFrame并使用 histogram_numeric UDF。

Not exactly better but alternative way is to convert a RDD to a DataFrame and use histogram_numeric UDF.

示例数据：

import scala.util.Random
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.{callUDF, lit, col}
import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext

val sqlContext = new HiveContext(sc)

Random.setSeed(1)

val ncol = 5

val rdd = sc.parallelize((1 to 1000).map(
  _ => Row.fromSeq(Array.fill(ncol)(Random.nextDouble))
))

val schema = StructType(
  (1 to ncol).map(i => StructField(s"x$i", DoubleType, false)))

val df = sqlContext.createDataFrame(rdd, schema)
df.registerTempTable("df")

查询：

val nBuckets = 3
val columns = df.columns.map(
  c => callUDF("histogram_numeric", col(c), lit(nBuckets)).alias(c))
val histograms = df.select(columns: _*)

histograms.printSchema

// root
//  |-- x1: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- x: double (nullable = true)
//  |    |    |-- y: double (nullable = true)
//  |-- x2: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- x: double (nullable = true)
//  |    |    |-- y: double (nullable = true)
//  |-- x3: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- x: double (nullable = true)
//  |    |    |-- y: double (nullable = true)
//  |-- x4: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- x: double (nullable = true)
//  |    |    |-- y: double (nullable = true)
//  |-- x5: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- x: double (nullable = true)
//  |    |    |-- y: double (nullable = true)

histograms.select($"x1").collect()

// Array([WrappedArray([0.16874313309969038,334.0],
//   [0.513382068667877,345.0], [0.8421388886903808,321.0])])

这篇关于如何使用Apache Spark Scala获取大型CSV / RDD [Array [double]]中的所有列的直方图？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用Apache Spark Scala获取大型CSV / RDD [Array [double]]中的所有列的直方图？ [英] How to get Histogram of all columns in a large CSV / RDD[Array[double]] using Apache Spark Scala?

问题描述

推荐答案

相关文章

Office最新文章

热门教程

热门工具

登录关闭

如何使用Apache Spark Scala获取大型CSV / RDD [Array [double]]中的所有列的直方图？ [英] How to get Histogram of all columns in a large CSV / RDD[Array[double]] using Apache Spark Scala?

问题描述

推荐答案

相关文章

Office最新文章

热门教程

热门工具

登录 关闭

登录关闭