如何从Scala Spark中的Excel(xls,xlsx)文件构造数据框? [英] How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?
问题描述
我有一个很大的Excel(xlsx and xls)
文件,里面有多张纸,我需要将其转换为RDD
或Dataframe
,以便以后可以与其他dataframe
连接.我当时正在考虑使用 Apache POI 并将其保存为CSV
,然后在
I have a large Excel(xlsx and xls)
file with multiple sheet and I need convert it to RDD
or Dataframe
so that it can be joined to other dataframe
later. I was thinking of using Apache POI and save it as a CSV
and then read csv
in dataframe
. But if there is any libraries or API that can help in this Process would be easy. Any help is highly appreciated.
推荐答案
您的问题的解决方案是在项目中使用Spark Excel
依赖项.
The solution to your problem is to use Spark Excel
dependency in your project.
Spark Excel 具有灵活的options
可以玩.
我已经测试了以下代码,可以从excel
中读取并将其转换为dataframe
,并且效果很好
I have tested the following code to read from excel
and convert it to dataframe
and it just works perfect
def readExcel(file: String): DataFrame = sqlContext.read
.format("com.crealytics.spark.excel")
.option("location", file)
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "False")
.load()
val data = readExcel("path to your excel file")
data.show(false)
如果您的Excel工作表有多张工作表,您可以将sheetname
设置为option
you can give sheetname
as option
if your excel sheet has multiple sheets
.option("sheetName", "Sheet2")
我希望它对您有帮助
这篇关于如何从Scala Spark中的Excel(xls,xlsx)文件构造数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!