如何将文件传递到主节点？ [英] How to pass files to the master node?

查看：311 发布时间：2016/5/22 15:18:05 python apache-spark pyspark

本文介绍了如何将文件传递到主节点？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在python

我已经写code实现二元分类，我想基于使用Apache的Spark在我的本地计算机不同的数据文件并行这种分类过程。我已经做了以下措施：

我已经写了含有4个Python文件整个项目：run_classifer.py（用于运行我的分类应用程序），classifer.py（用于二进制分类），load_params.py（用于负载分类的学习参数）和preprocessing.py（用于$ p $对处理数据）。该项目还采用了相关文件：tokenizer.perl（以preprocessing部分使用）和nonbreaking_ prefixes / nonbreaking_ prefix.en（也preprocessing部分使用）<。 / p>

我的脚本文件run_classifer.py的主要部分被定义为跟随，

  ###初始化星火
CONF = SparkConf（）。setAppName（ruofan）。setMaster（本地）
SC = SparkContext（CONF = CONF，
    pyFiles = ['''在我的项目所有的Python文件，
             还有nonbreaking_ prefix.en和tokenizer.perl''']）###阅读从S3存储数据的目录，并创建RDD
数据文件= sc.wholeTextFiles（S3N：//桶/ DATA_DIR）###发送在每个从属节点的应用
datafile.foreach（拉姆达（路径，内容）：分类（路径，内容））

然而，当我运行我的脚本run_classifier.py，好像找不到文件nonbreaking_ prefix.en。以下是我得到的错误：

在没有找到缩写文件：
错误/tmp/spark-f035270e-e267-4d71-9bf1-8c42ca2097ee/userFiles-88093e1a-6096-4592-8a71-be5548a4f8ae/nonbreaking_$p$pfixes

不过，我居然通过了文件nonbreaking_ prefix.en到主节点，我有错误的提示。我真的AP preciate如果有人可以帮助我解决这个问题。

解决方案

您可以通过上传文件 sc.addFile ，并使用<$ C $一名工人得到路径C> SparkFiles.get ：
 从pyspark进口SparkFilesSC =（SparkContext（CONF = CONF，
    pyFiles =全部，蟒，文件，中，你，项目]）＃假设两个文件都在你的工作目录
sc.addFile（nonbreaking_ prefix.en）
sc.addFile（tokenizer.perl）DEF分类（路径，内容）：
   ＃获取上传的文件路径
   打印SparkFiles.get（tokenizer.perl）   开放（SparkFiles.get（nonbreaking_ prefix.en））作为FR：
       行= [在FR逐行]
 
I've already written code in python to implement binary classification, and I want to parallelize this classification process based on different data files in my local computer using Apache-Spark. I have already done the following steps:
I've written the whole project containing 4 python files: "run_classifer.py" (used for running my classification application), "classifer.py" (used for binary classification), "load_params.py" (used for load the learning parameters for classification) and "preprocessing.py" (used for pre-processing data). The project also uses the dependency files: "tokenizer.perl" (used in preprocessing part) and "nonbreaking_prefixes/nonbreaking_prefix.en" (also used in preprocessing part).
The main part of my script file "run_classifer.py" is defined as follow,
### Initialize the Spark
conf = SparkConf().setAppName("ruofan").setMaster("local")
sc = SparkContext(conf = conf,
    pyFiles=['''All python files in my project as
             well as "nonbreaking_prefix.en" and "tokenizer.perl"'''])

### Read data directory from S3 storage, and create RDD
datafile = sc.wholeTextFiles("s3n://bucket/data_dir") 

### Sent the application on each of the slave node
datafile.foreach(lambda (path, content): classifier(path, content)) 
However, When I run my script "run_classifier.py", it seems like cannot find the file "nonbreaking_prefix.en". The following is the error I got:

ERROR: No abbreviations files found in /tmp/spark-f035270e-e267-4d71-9bf1-8c42ca2097ee/userFiles-88093e1a-6096-4592-8a71-be5548a4f8ae/nonbreaking_prefixes

But I actually passed the file "nonbreaking_prefix.en" to the master node, and I have no ideas on the error. I would really appreciate if anyone helps me fix the problem.
解决方案
You can upload your files using sc.addFile and get path on a worker using SparkFiles.get:
from pyspark import SparkFiles

sc = (SparkContext(conf = conf,
    pyFiles=["All",  "Python", "Files",  "in",  "your", "project"])

# Assuming both files are in your working directory
sc.addFile("nonbreaking_prefix.en")
sc.addFile("tokenizer.perl")

def classifier(path, content):
   # Get path for uploaded files
   print SparkFiles.get("tokenizer.perl")

   with open(SparkFiles.get("nonbreaking_prefix.en")) as fr:
       lines = [line for line in fr]
这篇关于如何将文件传递到主节点？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何将文件传递到主节点？ [英] How to pass files to the master node?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何将文件传递到主节点？ [英] How to pass files to the master node?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭