如何从 Scala Spark 中的 Excel (xls,xlsx) 文件构造数据框? [英] How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?
问题描述
我有一个包含多个工作表的大型 Excel(xlsx 和 xls)
文件,我需要将其转换为 RDD
或 Dataframe
以便它可以稍后加入其他dataframe
.我正在考虑使用 Apache POI 并将其另存为 CSV
然后阅读 <dataframe
中的 code>csv.但是,如果有任何库或 API 可以帮助此流程,那就很容易了.非常感谢任何帮助.
I have a large Excel(xlsx and xls)
file with multiple sheet and I need convert it to RDD
or Dataframe
so that it can be joined to other dataframe
later. I was thinking of using Apache POI and save it as a CSV
and then read csv
in dataframe
. But if there is any libraries or API that can help in this Process would be easy. Any help is highly appreciated.
推荐答案
解决您的问题的方法是在您的项目中使用 Spark Excel
依赖项.
The solution to your problem is to use Spark Excel
dependency in your project.
Spark Excel 具有灵活的选项
一起玩.
Spark Excel has flexible options
to play with.
我已经测试了以下代码以从 excel
读取并将其转换为 dataframe
并且它完美运行
I have tested the following code to read from excel
and convert it to dataframe
and it just works perfect
def readExcel(file: String): DataFrame = sqlContext.read
.format("com.crealytics.spark.excel")
.option("location", file)
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "False")
.load()
val data = readExcel("path to your excel file")
data.show(false)
如果你的excel表有多个表,你可以给sheetname
作为option
you can give sheetname
as option
if your excel sheet has multiple sheets
.option("sheetName", "Sheet2")
希望对你有帮助
这篇关于如何从 Scala Spark 中的 Excel (xls,xlsx) 文件构造数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!