星火处理来自S3很多的tar.gz文件 [英] Spark to process many tar.gz files from s3

查看：203 发布时间：2015/12/1 10:54:43 amazon-s3 apache-spark

本文介绍了星火处理来自S3很多的tar.gz文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在格式的日志，名为.tar.gz在S3中许多文件。我想处理它们，对它们进行处理（提取每行一个字段），并将其存储在一个新的文件。

I have many files in the format log-.tar.gz in s3. I would like to process them, process them (extract a field from each line) and store it in a new file.

有很多方面，我们可以做到这一点。一个简便的方法是使用TEXTFILE方法来访问文件。

There are many ways we can do this. One simple and convenient method is to access the files using textFile method.

//Read file from s3
rdd = sc.textFile("s3://bucket/project_name/date_folder/logfile1.*.gz")

我关心集群的内存限制。这样一来，主节点将过载。是否有任何粗略估计为文件，可以由簇的类型来处理大小？

I am concerned about the memory limit of the cluster. This way, the master node will be overloaded. Is there any rough estimate for the size of the files that can be processed by the type of clusters?

我想知道是否有一种方法来并行从S3因为他们已经分组按日期获取*。广州文件的过程。

I am wondering if there is a way to parallelize the process of getting the *.gz files from s3 as they are already grouped by date.

星火处理来自S3很多的tar.gz文件 [英] Spark to process many tar.gz files from s3

问题描述

推荐答案

相关文章

云存储最新文章

热门教程

热门工具

登录关闭

星火处理来自S3很多的tar.gz文件 [英] Spark to process many tar.gz files from s3

问题描述

推荐答案

相关文章

云存储最新文章

热门教程

热门工具

登录 关闭

登录关闭