如何读取多个Excel文件并将它们连接成一个Apache Spark DataFrame? [英] How to read multiple Excel files and concatenate them into one Apache Spark DataFrame?

查看:2240
本文介绍了如何读取多个Excel文件并将它们连接成一个Apache Spark DataFrame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我想从Spark Summit 2016开始做Spark Spark学习实验室。培训视频是 here 和导出的笔记本可用此处



实验室中使用的数据集可从 UCI机器学习资料库。它包含一组来自燃气发电厂的各种传感器的读数。格式为xlsx文件,带有五张。



要使用实验室中的数据,我需要从Excel文件中读取所有表格,并将其连接成一个Spark DataFrame 。在培训期间,他们正在使用Databricks Notebook,但是我正在使用IntelliJ IDEA与Scala并评估控制台中的代码。



第一步是将所有Excel表格保存到单独的xlsx文件名为 sheet1.xlxs sheet2.xlsx 等,并将它们放入 目录。



如何阅读所有Excel文件并将其连接成一个Apache Spark DataFrame?

解决方案

为此,我使用了 spark-excel 软件包。它可以添加到build.sbt文件: libraryDependencies + =com.crealytics%%spark-excel%0.8.2



在IntelliJ IDEA Scala Console中执行的代码是:

  import org.apache。 spark。Spark Spark,SparkContext} 
import org.apache.spark.sql。{SparkSession,DataFrame}
import java.io.File

val conf = new SparkConf() .setAppName(Excel to DataFrame)。setMaster(local [*])
val sc = new SparkContext(conf)
sc.setLogLevel(WARN)

val spark = SparkSession.builder()。getOrCreate()

//使用spark-excel读取xlsx文件的功能。
//具有尾随点的代码格式可以作为块发送到IJ Scala Console。
def readExcel(file:String):DataFrame = spark.read。
格式(com.crealytics.spark.excel)。
选项(位置,文件)。
选项(useHeader,true)。
选项(treatEmptyValuesAsNulls,true)。
选项(inferSchema,true)。
选项(addColorColumns,False)。
load()

val dir = new File(./ data / CCPP / sheets)
val excelFiles = dir.listFiles.sorted.map(f => f.toString)// Array [String]

val dfs = excelFiles.map(f => readExcel(f))// Array [DataFrame]
val ppdf = dfs.reduce (_.union(_))// DataFrame

ppdf.count()// res3:Long = 47840
ppdf.show(5)

控制台输出:

  --- + ----- + ------- + ----- + ------ + 
| AT | V | AP | RH | PE |
+ ----- + ----- + ------- + ----- + ------ +
| 14.96 | 41.76 | 1024.07 | 73.17 | 463.26 |
| 25.18 | 62.96 | 1020.04 | 59.08 | 444.37 |
| 5.11 | 39.4 | 1012.16 | 92.14 | 488.56 |
| 20.86 | 57.32 | 1010.24 | 76.64 | 446.48 |
| 10.82 | 37.5 | 1009.23 | 96.62 | 473.9 |
+ ----- + ----- + ------- + ----- + ------ +
只显示前5行


Recently I wanted to do Spark Machine Learning Lab from Spark Summit 2016. Training video is here and exported notebook is available here.

The dataset used in the lab can be downloaded from UCI Machine Learning Repository. It contains a set of readings from various sensors in a gas-fired power generation plant. The format is xlsx file with five sheets.

To use the data in the lab I needed to read all the sheets form the Excel file and to concatenate them into one Spark DataFrame. During the training they are using Databricks Notebook but I was using IntelliJ IDEA with Scala and evaluating the code in the console.

The first step was to save all the Excel sheets into separate xlsx files named sheet1.xlxs, sheet2.xlsx etc. and put them into sheets directory.

How to read all the Excel files and concatenate them into one Apache Spark DataFrame?

解决方案

For this I have used spark-excel package. It can be added to build.sbt file as : libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"

The code to execute in IntelliJ IDEA Scala Console was:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, DataFrame}
import java.io.File

val conf = new SparkConf().setAppName("Excel to DataFrame").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")

val spark = SparkSession.builder().getOrCreate()

// Function to read xlsx file using spark-excel. 
// This code format with "trailing dots" can be sent to IJ Scala Console as a block.
def readExcel(file: String): DataFrame = spark.read.
  format("com.crealytics.spark.excel").
  option("location", file).
  option("useHeader", "true").
  option("treatEmptyValuesAsNulls", "true").
  option("inferSchema", "true").
  option("addColorColumns", "False").
  load()

val dir = new File("./data/CCPP/sheets")
val excelFiles = dir.listFiles.sorted.map(f => f.toString)  // Array[String]

val dfs = excelFiles.map(f => readExcel(f))  // Array[DataFrame]
val ppdf = dfs.reduce(_.union(_))  // DataFrame 

ppdf.count()  // res3: Long = 47840
ppdf.show(5)

Console output:

+-----+-----+-------+-----+------+
|   AT|    V|     AP|   RH|    PE|
+-----+-----+-------+-----+------+
|14.96|41.76|1024.07|73.17|463.26|
|25.18|62.96|1020.04|59.08|444.37|
| 5.11| 39.4|1012.16|92.14|488.56|
|20.86|57.32|1010.24|76.64|446.48|
|10.82| 37.5|1009.23|96.62| 473.9|
+-----+-----+-------+-----+------+
only showing top 5 rows 

这篇关于如何读取多个Excel文件并将它们连接成一个Apache Spark DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆