如何读取多个Excel文件并将它们连接成一个Apache Spark DataFrame? [英] How to read multiple Excel files and concatenate them into one Apache Spark DataFrame?
问题描述
实验室中使用的数据集可从 UCI机器学习资料库。它包含一组来自燃气发电厂的各种传感器的读数。格式为xlsx文件,带有五张。
要使用实验室中的数据,我需要从Excel文件中读取所有表格,并将其连接成一个Spark DataFrame 。在培训期间,他们正在使用Databricks Notebook,但是我正在使用IntelliJ IDEA与Scala并评估控制台中的代码。
第一步是将所有Excel表格保存到单独的xlsx文件名为 sheet1.xlxs
, sheet2.xlsx
等,并将它们放入
目录。
如何阅读所有Excel文件并将其连接成一个Apache Spark DataFrame?
为此,我使用了 spark-excel 软件包。它可以添加到build.sbt文件: libraryDependencies + =com.crealytics%%spark-excel%0.8.2
在IntelliJ IDEA Scala Console中执行的代码是:
import org.apache。 spark。Spark Spark,SparkContext}
import org.apache.spark.sql。{SparkSession,DataFrame}
import java.io.File
val conf = new SparkConf() .setAppName(Excel to DataFrame)。setMaster(local [*])
val sc = new SparkContext(conf)
sc.setLogLevel(WARN)
val spark = SparkSession.builder()。getOrCreate()
//使用spark-excel读取xlsx文件的功能。
//具有尾随点的代码格式可以作为块发送到IJ Scala Console。
def readExcel(file:String):DataFrame = spark.read。
格式(com.crealytics.spark.excel)。
选项(位置,文件)。
选项(useHeader,true)。
选项(treatEmptyValuesAsNulls,true)。
选项(inferSchema,true)。
选项(addColorColumns,False)。
load()
val dir = new File(./ data / CCPP / sheets)
val excelFiles = dir.listFiles.sorted.map(f => f.toString)// Array [String]
val dfs = excelFiles.map(f => readExcel(f))// Array [DataFrame]
val ppdf = dfs.reduce (_.union(_))// DataFrame
ppdf.count()// res3:Long = 47840
ppdf.show(5)
控制台输出:
--- + ----- + ------- + ----- + ------ +
| AT | V | AP | RH | PE |
+ ----- + ----- + ------- + ----- + ------ +
| 14.96 | 41.76 | 1024.07 | 73.17 | 463.26 |
| 25.18 | 62.96 | 1020.04 | 59.08 | 444.37 |
| 5.11 | 39.4 | 1012.16 | 92.14 | 488.56 |
| 20.86 | 57.32 | 1010.24 | 76.64 | 446.48 |
| 10.82 | 37.5 | 1009.23 | 96.62 | 473.9 |
+ ----- + ----- + ------- + ----- + ------ +
只显示前5行
Recently I wanted to do Spark Machine Learning Lab from Spark Summit 2016. Training video is here and exported notebook is available here.
The dataset used in the lab can be downloaded from UCI Machine Learning Repository. It contains a set of readings from various sensors in a gas-fired power generation plant. The format is xlsx file with five sheets.
To use the data in the lab I needed to read all the sheets form the Excel file and to concatenate them into one Spark DataFrame. During the training they are using Databricks Notebook but I was using IntelliJ IDEA with Scala and evaluating the code in the console.
The first step was to save all the Excel sheets into separate xlsx files named sheet1.xlxs
, sheet2.xlsx
etc. and put them into sheets
directory.
How to read all the Excel files and concatenate them into one Apache Spark DataFrame?
For this I have used spark-excel package. It can be added to build.sbt file as : libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"
The code to execute in IntelliJ IDEA Scala Console was:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, DataFrame}
import java.io.File
val conf = new SparkConf().setAppName("Excel to DataFrame").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val spark = SparkSession.builder().getOrCreate()
// Function to read xlsx file using spark-excel.
// This code format with "trailing dots" can be sent to IJ Scala Console as a block.
def readExcel(file: String): DataFrame = spark.read.
format("com.crealytics.spark.excel").
option("location", file).
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "true").
option("inferSchema", "true").
option("addColorColumns", "False").
load()
val dir = new File("./data/CCPP/sheets")
val excelFiles = dir.listFiles.sorted.map(f => f.toString) // Array[String]
val dfs = excelFiles.map(f => readExcel(f)) // Array[DataFrame]
val ppdf = dfs.reduce(_.union(_)) // DataFrame
ppdf.count() // res3: Long = 47840
ppdf.show(5)
Console output:
+-----+-----+-------+-----+------+
| AT| V| AP| RH| PE|
+-----+-----+-------+-----+------+
|14.96|41.76|1024.07|73.17|463.26|
|25.18|62.96|1020.04|59.08|444.37|
| 5.11| 39.4|1012.16|92.14|488.56|
|20.86|57.32|1010.24|76.64|446.48|
|10.82| 37.5|1009.23|96.62| 473.9|
+-----+-----+-------+-----+------+
only showing top 5 rows
这篇关于如何读取多个Excel文件并将它们连接成一个Apache Spark DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!