apache spark:从目录中读取大型文件 [英] apache spark: Read large size files from a directory

查看:66
本文介绍了apache spark:从目录中读取大型文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 wholeTextFiles 读取目录中的每个文件.之后,我使用 map 在rdd的每个元素上调用一个函数.整个程序每个文件仅使用50行.代码如下:

I am reading each file of a directory using wholeTextFiles. After that I am calling a function on each element of the rdd using map . The whole program uses just 50 lines of each file. The code is as below:

def processFiles(fileNameContentsPair):
  fileName= fileNameContentsPair[0]
  result = "\n\n"+fileName
  resultEr = "\n\n"+fileName
  input = StringIO.StringIO(fileNameContentsPair[1])
  reader = csv.reader(input,strict=True)

  try:
       i=0
       for row in reader:
         if i==50:
           break
         // do some processing and get result string
         i=i+1
  except csv.Error as e:
    resultEr = resultEr +"error occured\n\n"
    return resultEr
  return result



if __name__ == "__main__":
  inputFile = sys.argv[1]
  outputFile = sys.argv[2]
  sc = SparkContext(appName = "SomeApp")
  resultRDD = sc.wholeTextFiles(inputFile).map(processFiles)
  resultRDD.saveAsTextFile(outputFile)

在我的情况下,目录中每个文件的大小可能非常大,因此,在这种情况下,使用 wholeTextFiles api效率不高.有什么有效的方法可以做到这一点吗?我可以考虑一次遍历目录的每个文件,但这似乎效率很低.我是新来的火花.请让我知道是否有任何有效的方法可以做到这一点.

The size of each file of the directory can be very large in my case and because of this reason use of wholeTextFiles api will be inefficient in this case. is there any efficient way to do this ? I can think of iterating over each file of the directory one by one but that also seems to be inefficient. I am new to spark. Please let me know if there is any efficient way to do this.

推荐答案

好吧,我建议您先将文件分割成较小的块,几个Gb太大而无法读取,这是造成延迟的主要原因.如果您的数据位于HDFS上,则每个文件可能有大约64MB的空间.否则,您应该尝试使用文件大小,因为它取决于您拥有的执行程序的数量.因此,如果您有更多较小的块,则可以增加它以具有更多的并行性.同样,您也可以增加分区以对其进行调整,因为您的 processFiles 函数似乎并不占用大量CPU.许多执行程序的唯一问题是I/O会增加,但是如果文件大小很小,那么问​​题就不大了.

Okay what I would suggest is to split your files first into smaller chunks, a few Gbs is too large to read which is the main cause of your delay. If your data is on HDFS, you could have something like 64MB for each file. Otherwise you should experiment with the file size because it depends on the number of executors that you have. So if you have more smaller chunks, you could increase this to have more parallelism. Likewise you can also increase your partition to tune it as your processFiles function does not seem to be CPU intensive. The only problem with a lot of executors is that I/O increases but if file size is small that shouldn't be much of the problem.

顺便说一句,不需要临时目录, wholeTextFiles 支持通配符,例如 * .还要注意,如果您将S3用作文件系统,那么如果您有太多的小文件,可能会遇到瓶颈,因为读取可能要花一些时间而不是大文件.因此,这并非微不足道.

By the way, there is no need for a temp directory, wholeTextFilessupports wildcards like *. Also note if you use S3 as a filesystem, there might be a bottleneck if you have too many small files as reading can take up a while instead of a large file. So this is not trivial.

希望这会有所帮助!

这篇关于apache spark:从目录中读取大型文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆