将数据帧保存到本地文件系统会导致结果为空 [英] Saving dataframe to local file system results in empty results

查看:91
本文介绍了将数据帧保存到本地文件系统会导致结果为空的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在AWS EMR上运行spark 2.3.0.以下DataFrame"df"不为空且大小适中:

We are running spark 2.3.0 on AWS EMR. The following DataFrame "df" is non empty and of modest size:

scala> df.count
res0: Long = 4067

以下代码可以很好地将df写入hdfs:

The following code works fine for writing df to hdfs:

   scala> val hdf = spark.read.parquet("/tmp/topVendors")
hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint]

scala> hdf.count
res4: Long = 4067

但是,使用相同的代码写入本地parquetcsv文件最终会得到空的结果:

However using the same code to write to a local parquet or csv file end up with empty results:

df.repartition(1).write.mode("overwrite").parquet("file:///tmp/topVendors")

scala> val locdf = spark.read.parquet("file:///tmp/topVendors")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
  at scala.Option.getOrElse(Option.scala:121)

我们可以看到失败的原因:

We can see why it fails:

 ls -l /tmp/topVendors
total 0
-rw-r--r-- 1 hadoop hadoop 0 Jul 30 22:38 _SUCCESS

因此没有没有实木复合地板文件正在写入.

So there is no parquet file being written.

对于csvparquet以及在两个不同的EMR服务器上,我已经尝试了20次了:在所有情况下都表现出相同的行为.

I have tried this maybe twenty times and for both csv and parquet and on two different EMR Servers: this same behavior is exhibited in all cases.

这是特定于EMR的错误吗?更一般的EC2错误?还有别的吗这段代码适用于macOS上的spark.

Is this an EMR specific bug? A more general EC2 bug? Something else? This code works on spark on macOS.

以防万一-这是版本信息:

In case it matters - here is the versioning info:

Release label:emr-5.13.0
Hadoop distribution:Amazon 2.8.3
Applications:Spark 2.3.0, Hive 2.3.2, Zeppelin 0.7.3

推荐答案

这不是bug,它是预期的行为. Spark并不真正支持对非分布式存储的写操作(仅在您拥有共享文件系统的情况下,它才能在local模式下工作).

That is not a bug and it is the expected behavior. Spark does not really support writes to non-distributed storage (it will work in local mode, just because you have shared file system).

本地路径不(仅)解释为驱动程序上的路径(这将需要收集数据),而是每个执行程序上的本地路径.因此,每个执行者都将自己的块写入其自己的本地文件系统.

Local path is not interpreted (only) as a path on the driver (this would require collecting the data) but local path on each executor. Therefore each executor will write its own chunk to its own local file system.

不仅输出不可读(要加载数据,每个执行程序和驱动程序应该看到文件系统的相同状态),而且取决于提交算法,甚至可能无法最终确定(从临时目录移动).

Not only output is no readable back (to load data each executor and the driver should see the same state of the file system), but depending on the commit algorithm, might not be even finalized (move from the temporary directory).

这篇关于将数据帧保存到本地文件系统会导致结果为空的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆