用ftp读远程机器上的文件在Apache中星火 [英] Read file on remote machine in Apache Spark using ftp

查看：1245 发布时间：2016/5/22 16:30:33 scala apache-spark ftp

本文介绍了用ftp读远程机器上的文件在Apache中星火的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想读取远程计算机上的Apache中的星火使用ftp文件（斯卡拉版）。目前，我已经遵循了GitHub上Databricks的学习星火回购的例子。使用curl，我可以下载的文件，所以我使用的路径存在。

I am trying to read a file on an remote machine in Apache Spark (the Scala version) using ftp. Currently, I have followed an example in the Learning Spark repo of Databricks on GitHub. Using curl, I am able to download the file, so the path I uses exists.

下面是code的片断我试图执行：

Below is a snippet of the code I try to execute:

var file = sc.textFile("ftp://user:pwd/192.168.1.5/brecht-d-m/map/input.nt")
var fileDF = file.toDF()
fileDF.write.parquet("out")

试图在数据框执行计数后，我获得以下堆栈跟踪（ http://pastebin.com/YEq8c2Hf）

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#1L])
+- TungstenExchange SinglePartition, None
   +- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#4L])
      +- Project
         +- Scan ExistingRDD[_1#0]

...

Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: ftp://user:pwd@192.168.1.5/brecht-d-m/map/input.nt

我会假设该文件将无法访问，但这是在矛盾与我能够通过卷曲来检索文件：

I would assume that the file would be unreachable, but this is in contradiction with that I am able to retrieve the file via curl:

curl ftp://user:pwd@192.168.1.5/brecht-d-m/map/input.nt

这将打印出我的终端上的特定文件。我不明白我在做什么错在斯卡拉code。有没有在code段错误我给上面，或者是code完全错误的？

This will print out the specific file on my terminal. I do not see what I am doing wrong in the Scala code. Is there an error in the code snippet I gave above, or is that code totally wrong?

在此先感谢，
布莱希特

Thanks in advance, Brecht

请注意：

指定的完整路径（/home/brecht-dm/map/input.nt）也不起作用（如预期，因为这也是在袅袅不工作;服务器拒绝了你改变给定目录）。在星火尝试这一点，给了那些寻求不支持IOException异常（ http://pastebin.com/b9EB9ru2 ）。

在本地工作的（例如sc.textFile（/家庭/布莱希特-D-M /地图/ input.nt））完美的作品。

Working locally (e.g. sc.textFile("/home/brecht-d-m/map/input.nt")) works perfectly.

有关特定文件文件权限设置为R + W的所有用户。

File permissions for specific file is set to R+W for all users.

的文件大小（15MB）不应该是一个问题，它应该能够处理更大的文件

The file size (15MB) should not be a problem, and it should be able to handle much bigger files.

软件版本：斯卡拉2.11.7，阿帕奇星火1.6.0，Java的1.8.0_74的，Ubuntu 14.04.4

Software versions: Scala 2.11.7, Apache Spark 1.6.0, Java 1.8.0_74, Ubuntu 14.04.4

推荐答案

我能够找到一个解决办法。通过下面的codesnippet：

I was able to find a workaround. Via the codesnippet below:

import org.apache.spark.SparkFiles

val dataSource = "ftp://user:pwd/192.168.1.5/brecht-d-m/map/input.nt"
sc.addFile(dataSource)
var fileName = SparkFiles.get(dataSource.split("/").last)
var file = sc.textFile(fileName)

我能够下载FTP上的文件（使用相同的URL从第一code段）。此变通办法将首先下载文件（通过addFile）。接下来，我检索到的路径文件下载。最后，我使用的路径，文件加载到RDD。

I am able to download a file over FTP (with the same URL as from the first code snippet). This workaround will first download the file (via addFile). Next, I retrieve the path to where the file was downloaded. Finally, I use that path to load that file into an RDD.

这篇关于用ftp读远程机器上的文件在Apache中星火的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用ftp读远程机器上的文件在Apache中星火 [英] Read file on remote machine in Apache Spark using ftp

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用ftp读远程机器上的文件在Apache中星火 [英] Read file on remote machine in Apache Spark using ftp

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭