如何将文件传递给主节点? [英] How to pass files to the master node?
问题描述
我已经用python编写了代码来实现二进制分类,我想使用Apache-Spark根据本地计算机中的不同数据文件并行化这个分类过程.我已经完成了以下步骤:
I've already written code in python to implement binary classification, and I want to parallelize this classification process based on different data files in my local computer using Apache-Spark. I have already done the following steps:
我编写了包含 4 个 python 文件的整个项目:run_classifer.py"(用于运行我的分类应用程序)、classifer.py"(用于二进制分类)、load_params.py"(用于加载用于分类的学习参数)和preprocessing.py"(用于预处理数据).该项目还使用了依赖文件:tokenizer.perl"(用于预处理部分)和nonbreaking_prefixes/nonbreaking_prefix.en"(也用于预处理部分).
I've written the whole project containing 4 python files: "run_classifer.py" (used for running my classification application), "classifer.py" (used for binary classification), "load_params.py" (used for load the learning parameters for classification) and "preprocessing.py" (used for pre-processing data). The project also uses the dependency files: "tokenizer.perl" (used in preprocessing part) and "nonbreaking_prefixes/nonbreaking_prefix.en" (also used in preprocessing part).
我的脚本文件run_classifer.py"的主要部分定义如下,
The main part of my script file "run_classifer.py" is defined as follow,
### Initialize the Spark
conf = SparkConf().setAppName("ruofan").setMaster("local")
sc = SparkContext(conf = conf,
pyFiles=['''All python files in my project as
well as "nonbreaking_prefix.en" and "tokenizer.perl"'''])
### Read data directory from S3 storage, and create RDD
datafile = sc.wholeTextFiles("s3n://bucket/data_dir")
### Sent the application on each of the slave node
datafile.foreach(lambda (path, content): classifier(path, content))
但是,当我运行脚本run_classifier.py"时,似乎找不到文件nonbreaking_prefix.en".以下是我得到的错误:
However, When I run my script "run_classifier.py", it seems like cannot find the file "nonbreaking_prefix.en". The following is the error I got:
错误:在/tmp/spark-f035270e-e267-4d71-9bf1-8c42ca2097ee/userFiles-88093e1a-6096-4592-8a71-be5548a4f8ae/non-broking 中找不到缩写文件
ERROR: No abbreviations files found in /tmp/spark-f035270e-e267-4d71-9bf1-8c42ca2097ee/userFiles-88093e1a-6096-4592-8a71-be5548a4f8ae/nonbreaking_prefixes
但我实际上将文件nonbreaking_prefix.en"传递给了主节点,我对错误没有任何想法.如果有人帮助我解决问题,我将不胜感激.
But I actually passed the file "nonbreaking_prefix.en" to the master node, and I have no ideas on the error. I would really appreciate if anyone helps me fix the problem.
推荐答案
您可以使用 sc.addFile
上传文件,并使用 SparkFiles.get
获取工作人员的路径:
You can upload your files using sc.addFile
and get path on a worker using SparkFiles.get
:
from pyspark import SparkFiles
sc = (SparkContext(conf = conf,
pyFiles=["All", "Python", "Files", "in", "your", "project"])
# Assuming both files are in your working directory
sc.addFile("nonbreaking_prefix.en")
sc.addFile("tokenizer.perl")
def classifier(path, content):
# Get path for uploaded files
print SparkFiles.get("tokenizer.perl")
with open(SparkFiles.get("nonbreaking_prefix.en")) as fr:
lines = [line for line in fr]
这篇关于如何将文件传递给主节点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!