如何将文件传递到主节点? [英] How to pass files to the master node?
问题描述
我已经写code实现二元分类,我想基于使用Apache的Spark在我的本地计算机不同的数据文件并行这种分类过程。我已经做了以下措施:
-
我已经写了含有4个Python文件整个项目:run_classifer.py(用于运行我的分类应用程序),classifer.py(用于二进制分类),load_params.py(用于负载分类的学习参数)和preprocessing.py(用于$ p $对处理数据)。该项目还采用了相关文件:tokenizer.perl(以preprocessing部分使用)和nonbreaking_ prefixes / nonbreaking_ prefix.en(也preprocessing部分使用)<。 / p>
-
我的脚本文件run_classifer.py的主要部分被定义为跟随,
###初始化星火
CONF = SparkConf()。setAppName(ruofan)。setMaster(本地)
SC = SparkContext(CONF = CONF,
pyFiles = ['''在我的项目所有的Python文件,
还有nonbreaking_ prefix.en和tokenizer.perl'''])###阅读从S3存储数据的目录,并创建RDD
数据文件= sc.wholeTextFiles(S3N://桶/ DATA_DIR)###发送在每个从属节点的应用
datafile.foreach(拉姆达(路径,内容):分类(路径,内容))
然而,当我运行我的脚本run_classifier.py,好像找不到文件nonbreaking_ prefix.en。以下是我得到的错误:
在没有找到缩写文件:错误/tmp/spark-f035270e-e267-4d71-9bf1-8c42ca2097ee/userFiles-88093e1a-6096-4592-8a71-be5548a4f8ae/nonbreaking_$p$pfixes
块引用>不过,我居然通过了文件nonbreaking_ prefix.en到主节点,我有错误的提示。我真的AP preciate如果有人可以帮助我解决这个问题。
解决方案您可以通过上传文件
sc.addFile
,并使用<$ C $一名工人得到路径C> SparkFiles.get :从pyspark进口SparkFilesSC =(SparkContext(CONF = CONF,
pyFiles =全部,蟒,文件,中,你,项目])#假设两个文件都在你的工作目录
sc.addFile(nonbreaking_ prefix.en)
sc.addFile(tokenizer.perl)DEF分类(路径,内容):
#获取上传的文件路径
打印SparkFiles.get(tokenizer.perl) 开放(SparkFiles.get(nonbreaking_ prefix.en))作为FR:
行= [在FR逐行]I've already written code in python to implement binary classification, and I want to parallelize this classification process based on different data files in my local computer using Apache-Spark. I have already done the following steps:
I've written the whole project containing 4 python files: "run_classifer.py" (used for running my classification application), "classifer.py" (used for binary classification), "load_params.py" (used for load the learning parameters for classification) and "preprocessing.py" (used for pre-processing data). The project also uses the dependency files: "tokenizer.perl" (used in preprocessing part) and "nonbreaking_prefixes/nonbreaking_prefix.en" (also used in preprocessing part).
The main part of my script file "run_classifer.py" is defined as follow,
### Initialize the Spark conf = SparkConf().setAppName("ruofan").setMaster("local") sc = SparkContext(conf = conf, pyFiles=['''All python files in my project as well as "nonbreaking_prefix.en" and "tokenizer.perl"''']) ### Read data directory from S3 storage, and create RDD datafile = sc.wholeTextFiles("s3n://bucket/data_dir") ### Sent the application on each of the slave node datafile.foreach(lambda (path, content): classifier(path, content))
However, When I run my script "run_classifier.py", it seems like cannot find the file "nonbreaking_prefix.en". The following is the error I got:
ERROR: No abbreviations files found in /tmp/spark-f035270e-e267-4d71-9bf1-8c42ca2097ee/userFiles-88093e1a-6096-4592-8a71-be5548a4f8ae/nonbreaking_prefixes
But I actually passed the file "nonbreaking_prefix.en" to the master node, and I have no ideas on the error. I would really appreciate if anyone helps me fix the problem.
解决方案You can upload your files using
sc.addFile
and get path on a worker usingSparkFiles.get
:from pyspark import SparkFiles sc = (SparkContext(conf = conf, pyFiles=["All", "Python", "Files", "in", "your", "project"]) # Assuming both files are in your working directory sc.addFile("nonbreaking_prefix.en") sc.addFile("tokenizer.perl") def classifier(path, content): # Get path for uploaded files print SparkFiles.get("tokenizer.perl") with open(SparkFiles.get("nonbreaking_prefix.en")) as fr: lines = [line for line in fr]
这篇关于如何将文件传递到主节点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!