使用Apache的火花在AWS上加载数据 [英] Load Data using Apache-Spark on AWS

查看:209
本文介绍了使用Apache的火花在AWS上加载数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用亚马逊网络服务(AWS)-EC2 Apache的火花来加载和处理数据。我创建了一个主站和两个从节点。在主节点上,我有数据包含与CSV格式的所有数据要处理的文件的目录。

I am using Apache-Spark on Amazon Web Service (AWS)-EC2 to load and process data. I've created one master and two slave nodes. On the master node, I have a directory data containing all data files with csv format to be processed.

我们提交的驱动程序(这是我的蟒蛇code)运行现在之前,我们需要将数据目录复制数据从主机向所有从机节点。对于我的理解,我认为这是因为每一个从节点需要知道自己的本地文件系统中的数据文件的位置,以便它可以加载数据文件。例如,

Now before we submit the driver program (which is my python code) to run, we need to copy the data directory data from the master to all slave nodes. For my understanding, I think it is because each slave node needs to know data file location in its own local file systems so it can load data file. For example,

from pyspark import SparkConf, SparkContext

### Initialize the SparkContext
conf = SparkConf().setAppName("ruofan").setMaster("local")
sc = SparkContext(conf = conf)

### Create a RDD containing metadata about files in directory "data"
datafile = sc.wholeTextFiles("/root/data")  ### Read data directory 

### Collect files from the RDD
datafile.collect() 

在每一个从节点上运行的任务,它加载数据文件从本地文件系统。

When each slave node runs the task, it loads data file from its local file system.

不过,在我们提交我的应用程序运行,我们也必须把目录数据到使用Hadoop分布式文件系统(HDFS) $ ./ephemeral-hdfs/bin/hadoop FS -put /根/数据/〜

However, before we submit my application to run, we also have to put the directory data into the Hadoop Distributed File System (HDFS) using $ ./ephemeral-hdfs/bin/hadoop fs -put /root/data/ ~.

现在我弄不清楚这个过程。是否从自己的本地文件系统或HDFS每一个从节点加载数据文件吗?如果它加载数据从本地文件系统中,为什么我们需要把数据到HDFS?我想AP preciate如果有人能帮助我。

Now I get confused about this process. Does each slave node load data files from its own local file system or HDFS? If it loads data from the local file system, why do we need to put data into HDFS? I would appreciate if anyone can help me.

推荐答案

一个快速的建议是从S3装载有它在本地的CSV代替。

One quick suggestion is to load csv from S3 instead of having it in local.

下面是一个简单的阶段,可用于装载从S3水桶

Here is a sample scala snippet which can be used to load a bucket from S3

val csvs3Path = "s3n://REPLACE_WITH_YOUR_ACCESS_KEY:REPLACE_WITH_YOUR_SECRET_KEY@REPLACE_WITH_YOUR_S3_BUCKET"
val dataframe = sqlContext.
                    read.
                    format("com.databricks.spark.csv").
                    option("header", "true").
                    load(leadsS3Path)

这篇关于使用Apache的火花在AWS上加载数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆