将数据帧保存到本地文件系统会导致结果为空 [英] Saving dataframe to local file system results in empty results

查看:34
本文介绍了将数据帧保存到本地文件系统会导致结果为空的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在 AWS EMR 上运行 spark 2.3.0.以下 DataFrame "df" 非空且大小适中:

We are running spark 2.3.0 on AWS EMR. The following DataFrame "df" is non empty and of modest size:

scala> df.count
res0: Long = 4067

以下代码适用于将 df 写入 hdfs:

The following code works fine for writing df to hdfs:

   scala> val hdf = spark.read.parquet("/tmp/topVendors")
hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint]

scala> hdf.count
res4: Long = 4067

但是,使用相同的代码写入本地 parquetcsv 文件最终会得到空结果:

However using the same code to write to a local parquet or csv file end up with empty results:

df.repartition(1).write.mode("overwrite").parquet("file:///tmp/topVendors")

scala> val locdf = spark.read.parquet("file:///tmp/topVendors")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
  at scala.Option.getOrElse(Option.scala:121)

我们可以看到它失败的原因:

We can see why it fails:

 ls -l /tmp/topVendors
total 0
-rw-r--r-- 1 hadoop hadoop 0 Jul 30 22:38 _SUCCESS

所以没有正在写入镶木地板文件.

So there is no parquet file being written.

我已经尝试了大约二十次,对于 csvparquet 以及两个不同的 EMR 服务器:同样的行为表现在所有情况.

I have tried this maybe twenty times and for both csv and parquet and on two different EMR Servers: this same behavior is exhibited in all cases.

这是 EMR 特定的错误吗?更一般的 EC2 错误?还有什么?此代码适用于 macOS 上的 spark.

Is this an EMR specific bug? A more general EC2 bug? Something else? This code works on spark on macOS.

以防万一 - 这是版本信息:

In case it matters - here is the versioning info:

Release label:emr-5.13.0
Hadoop distribution:Amazon 2.8.3
Applications:Spark 2.3.0, Hive 2.3.2, Zeppelin 0.7.3

推荐答案

这不是错误,而是预期的行为.Spark 并不真正支持写入非分布式存储(它会在 local 模式下工作,只是因为您有共享文件系统).

That is not a bug and it is the expected behavior. Spark does not really support writes to non-distributed storage (it will work in local mode, just because you have shared file system).

本地路径不会(仅)被解释为驱动程序上的路径(这将需要收集数据),而是每个执行程序上的本地路径.因此,每个执行器都会将自己的块写入自己的本地文件系统.

Local path is not interpreted (only) as a path on the driver (this would require collecting the data) but local path on each executor. Therefore each executor will write its own chunk to its own local file system.

不仅输出是不可读的(加载数据每个执行器和驱动程序应该看到文件系统的相同状态),而且根据提交算法,甚至可能没有最终确定(从临时目录移动).

Not only output is no readable back (to load data each executor and the driver should see the same state of the file system), but depending on the commit algorithm, might not be even finalized (move from the temporary directory).

这篇关于将数据帧保存到本地文件系统会导致结果为空的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆