如何知道Databricks支持的文件格式? [英] How to know the file formats supported by Databricks?
问题描述
我需要将各种文件(不同类型)加载到spark数据帧中. Databricks是否支持所有这些文件格式?如果是,我在哪里可以获得每种文件格式支持的选项列表?
delimited
csv
parquet
avro
excel
json
谢谢
我不完全了解Databricks提供的现成功能(预安装),但是您可以进行一些反向工程 >使用 DataSourceRegister 接口(并使用shortName
提供其别名):
数据源应实现此特征,以便它们可以向其数据源注册别名.
通过阅读DataSourceRegister
的scaladoc,您会发现:
这允许用户在完全限定的类名上为数据源别名提供格式类型.
因此, YMMV .>
除非您在Databricks上找到权威的答案,否则您可能想要(按照 ServiceLoader.load 方法来查找所有已注册的DataSourceRegister
接口的实现.
// start a Spark application with external module with a separate DataSource
$ ./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0-SNAPSHOT
import java.util.ServiceLoader
import org.apache.spark.sql.sources.DataSourceRegister
val formats = ServiceLoader.load(classOf[DataSourceRegister])
import scala.collection.JavaConverters._
scala> formats.asScala.map(_.shortName).foreach(println)
orc
hive
libsvm
csv
jdbc
json
parquet
text
console
socket
kafka
在哪里可以获得每种文件格式支持的选项列表?
这是不可能的,因为没有没有的API可以定义选项(例如Spark MLlib中的API).每种格式都可以单独执行此操作...不幸的是,最好的选择是阅读文档或(更权威的)源代码.
I have a requirement to load various files (different type) into spark data frame. Are all these file formats supported by Databricks? If yes, where can I get the list of options supported for each file format?
delimited
csv
parquet
avro
excel
json
Thanks
I don't know exactly what Databricks offers out of the box (pre-installed), but you can do some reverse-engineering using org.apache.spark.sql.execution.datasources.DataSource object that is (quoting the scaladoc):
The main class responsible for representing a pluggable Data Source in Spark SQL
All data sources usually register themselves using DataSourceRegister interface (and use shortName
to provide their alias):
Data sources should implement this trait so that they can register an alias to their data source.
Reading along the scaladoc of DataSourceRegister
you'll find out that:
This allows users to give the data source alias as the format type over the fully qualified class name.
So, YMMV.
Unless you find an authoritative answer on Databricks, you may want to (follow DataSource.lookupDataSource and) use Java's ServiceLoader.load method to find all registered implementations of DataSourceRegister
interface.
// start a Spark application with external module with a separate DataSource
$ ./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0-SNAPSHOT
import java.util.ServiceLoader
import org.apache.spark.sql.sources.DataSourceRegister
val formats = ServiceLoader.load(classOf[DataSourceRegister])
import scala.collection.JavaConverters._
scala> formats.asScala.map(_.shortName).foreach(println)
orc
hive
libsvm
csv
jdbc
json
parquet
text
console
socket
kafka
Where can I get the list of options supported for each file format?
That's not possible as there is no API to follow (like in Spark MLlib) to define options. Every format does this on its own...unfortunately and your best bet is to read the documentation or (more authoritative) the source code.
这篇关于如何知道Databricks支持的文件格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!