通过 --files 提供给 Spark 时如何重命名文件 [英] How to rename a file when providing to Spark via --files

查看:125
本文介绍了通过 --files 提供给 Spark 时如何重命名文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

参考此处here,我希望我应该能够使用 octothorpe 更改在 Spark 中引用文件的名称 - 也就是说,如果我调用 spark-submit --files local-file-name.json#spark-file-name.json,然后我应该能够将文件引用为 spark-file-name.json.然而,情况似乎并非如此:

Referencing here and here, I expect that I should be able to change the name by which a file is referenced in Spark by using an octothorpe - that is, if I call spark-submit --files local-file-name.json#spark-file-name.json, I should then be able to reference the file as spark-file-name.json. However, this doesn't appear to be the case:

$ cat ../differentDirectory/local-file-name.json
{
  "name": "Adam",
  "age": 25
}

$ cat testing1.py
import os
import json
import time
from pyspark import SparkFiles, SparkContext

print(os.getcwd())
print(os.listdir('.'))
sc = SparkContext('local', 'App For Testing --files upload')
print(SparkFiles.getRootDirectory())
print(os.listdir(SparkFiles.getRootDirectory()))
print(json.load(open(SparkFiles.get('local-file-name.json'))))

$ spark-submit --files ../differentDirectory/local-file-name.json testing1.py
20/08/06 17:05:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
/private/tmp/sparkSubmitTesting
['testing.py']
...
/private/var/folders/0q/qw3xxl5x2yx1rf1nncl6s4rw2yzhgq/T/spark-2d052f27-59da-463a-9ddf-edd05108c19a/userFiles-5fec4b39-90e3-4402-a644-0c5314c1d0a5
[u'local-file-name.json']
{u'age': 25, u'name': u'Adam'}
...

$ cat testing2.py
import os
import json
import time
from pyspark import SparkFiles, SparkContext

print(os.getcwd())
print(os.listdir('.'))
sc = SparkContext('local', 'App For Testing --files upload')
print(SparkFiles.getRootDirectory())
print(os.listdir(SparkFiles.getRootDirectory()))
print(json.load(open(SparkFiles.get('spark-file-name.json'))))

$ spark-submit --files ../differentDirectory/local-file-name.json#spark-file-name.json testing2.py
20/08/06 17:07:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
/private/tmp/sparkSubmitTesting
['testing.py']
...
20/08/06 17:07:38 ERROR SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/private/tmp/differentDirectory/local-file-name.json#spark-file-name.json does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1544)
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1508)
    at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:462)
    at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:462)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:462)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

我试过反斜杠转义 #(即 - --files ../differentDirectory/local-file-name.json\#spark-file-name.json)包装文件路径,并显式添加 file:// 但在所有情况下,我都会遇到相同的错误(File 不存在),或 索引 5 处预期的特定于方案的部分

I've tried backslash-escaping the # (that is - --files ../differentDirectory/local-file-name.json\#spark-file-name.json) quote-wrapping the file path, and explicitly prepending file:// but in all cases I get either the same error (File <path, including fragment> does not exist), or Expected scheme-specific part at index 5

MacOS,Spark v2.4.5

MacOS, Spark v2.4.5

推荐答案

一位同事指出,这种重命名行为依赖于 YARN,而在本地运行时不存在 YARN - 因此该功能在我的设置中不适用.

A coworker pointed out that this renaming behaviour relies on YARN, which is absent when running locally - so this feature is not expected to work in my setup.

这篇关于通过 --files 提供给 Spark 时如何重命名文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆