如何在Spark中创建一个空的dataFrame [英] How to create an empty dataFrame in Spark

查看:189
本文介绍了如何在Spark中创建一个空的dataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组基于Avro的配置单元表,我需要从中读取数据.由于Spark-SQL使用配置单元Serdes从HDFS读取数据,因此它比直接读取HDFS慢得多.因此,我使用了数据砖Spark-Avro jar从底层HDFS目录中读取Avro文件.

I have a set of Avro based hive tables and I need to read data from them. As Spark-SQL uses hive serdes to read the data from HDFS, it is much slower than reading HDFS directly. So I have used data bricks Spark-Avro jar to read the Avro files from underlying HDFS dir.

一切正常,除非表为空.我已经使用以下命令从hive表的.avsc文件中获取了架构,但是出现了错误"未找到Avro文件"

Everything works fine except when the table is empty. I have managed to get the schema from the .avsc file of hive table using the following command but I am getting an error "No Avro files found"

val schemaFile = FileSystem.get(sc.hadoopConfiguration).open(new Path("hdfs://myfile.avsc"));

val schema = new Schema.Parser().parse(schemaFile);

spark.read.format("com.databricks.spark.avro").option("avroSchema", schema.toString).load("/tmp/myoutput.avro").show()

解决方法:

我在该目录中放置了一个空文件,同样的东西也可以正常工作.

I have placed an empty file in that directory and the same thing works fine.

还有其他方法可以实现相同目标吗?像conf设置之类的?

Are there any other ways to achieve the same? like conf setting or something?

推荐答案

您不需要使用emptyRDD.这是PySpark 2.4对我有用的东西:

You don't need to use emptyRDD. Here is what worked for me with PySpark 2.4:

empty_df = spark.createDataFrame([], schema) # spark is the Spark Session

如果您已经具有另一个数据框的架构,则可以执行以下操作:

If you already have a schema from another dataframe, you can just do this:

schema = some_other_df.schema

如果不这样做,则手动创建空数据框的架构,例如:

If you don't, then manually create the schema of the empty dataframe, for example:

schema = StructType([StructField("col_1", StringType(), True),
                     StructField("col_2", DateType(), True),
                     StructField("col_3", StringType(), True),
                     StructField("col_4", IntegerType(), False)]
                     )

我希望这会有所帮助.

这篇关于如何在Spark中创建一个空的dataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆