如何从Scala Spark中的Excel(xls,xlsx)文件构造数据框? [英] How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?

查看:432
本文介绍了如何从Scala Spark中的Excel(xls,xlsx)文件构造数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的Excel(xlsx and xls)文件,里面有多张纸,我需要将其转换为RDDDataframe,以便以后可以与其他dataframe连接.我当时正在考虑使用 Apache POI 并将其保存为CSV,然后在csv >.但是,如果有任何可以在此过程中提供帮助的库或API将会很容易.任何帮助都将受到高度赞赏.

I have a large Excel(xlsx and xls) file with multiple sheet and I need convert it to RDD or Dataframe so that it can be joined to other dataframe later. I was thinking of using Apache POI and save it as a CSV and then read csv in dataframe. But if there is any libraries or API that can help in this Process would be easy. Any help is highly appreciated.

推荐答案

您的问题的解决方案是在项目中使用Spark Excel依赖项.

The solution to your problem is to use Spark Excel dependency in your project.

Spark Excel 具有灵活的options可以玩.

我已经测试了以下代码,可以从excel中读取并将其转换为dataframe,并且效果很好

I have tested the following code to read from excel and convert it to dataframe and it just works perfect

def readExcel(file: String): DataFrame = sqlContext.read
    .format("com.crealytics.spark.excel")
    .option("location", file)
    .option("useHeader", "true")
    .option("treatEmptyValuesAsNulls", "true")
    .option("inferSchema", "true")
    .option("addColorColumns", "False")
    .load()

val data = readExcel("path to your excel file")

data.show(false)

如果您的Excel工作表有多张工作表,您可以将sheetname设置为option

you can give sheetname as option if your excel sheet has multiple sheets

.option("sheetName", "Sheet2")

我希望它对您有帮助

这篇关于如何从Scala Spark中的Excel(xls,xlsx)文件构造数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆