使用mapPartitions时,binaryFiles何时加载到内存中? [英] When do binaryFiles load into memory when mapPartitions is used?

查看:90
本文介绍了使用mapPartitions时,binaryFiles何时加载到内存中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PySpark将训练有素的深度学习模型应用于图像,并担心内存使用率如何随着我当前的方法扩展.由于训练后的模型需要一段时间才能加载,因此我使用类似于以下代码的代码在每个工作人员上处理大量图像:

I am using PySpark to apply a trained deep learning model to images and am concerned with how memory usage will scale with my current approach. Because the trained model takes a while to load, I process large batches of images on each worker with code similar to the following:

def run_eval(file_generator):
    trained_model = load_model()
    results = []
    for file in file_generator:
        # "file" is a tuple: [0] is its filename, [1] is the byte data
        results.append(trained_model.eval(file[1]))
    return(results)

my_rdd = sc.binaryFiles('adl://my_file_path/*.png').repartition(num_workers)
results = my_rdd.mapPartitions(run_eval)
results.collect()

如上所述,文件存储在关联的HDFS文件系统(特别是Azure Data Lake Store)上,可以通过SparkContext对其进行访问.

As noted above, the files are stored on an associated HDFS file system (specifically, an Azure Data Lake Store) which can be accessed through the SparkContext.

我的主要问题是:

  • 何时将图像数据加载到内存中?
    • 当生成器递增时(及时"),是否加载了每个图像的数据?
    • 在工作程序启动之前是否已加载整个分区的所有图像数据?
    • When is the image data being loaded into memory?
      • Is each image's data loaded when the generator increments ("just in time")?
      • Is all image data for the whole partition loaded before the worker starts?

      也非常感谢您提供建议,以帮助您深入了解这些主题.

      Also appreciate your advice on where to find these topics covered in depth.

      推荐答案

      何时将图像数据加载到内存中?

      When is the image data being loaded into memory?

      • 当生成器递增时(及时"),是否加载了每个图像的数据?

      实际上,给定您的代码,必须多次加载它.首先,它由JVM访问,然后转换为Python类型.此后,将发生随机播放,并再次加载数据.每个进程都是懒惰的,因此加载不是问题.

      Actually, given your code, it has to be loaded more than once. First it accessed by the JVM and then converted to a Python types. After that shuffle occurs and data is loaded once again. Each process is lazy so loading is not an issue.

      因此,您必须问自己的第一个问题是您是否真的必须洗牌. binaryFiles具有minPartitions自变量,可用于控制分区数.

      So the first question you have to ask yourself is if you really have to shuffle. binaryFiles has minPartitions argument which can be used to control the number of partitions.

      另一个问题是非懒惰results list.使用生成器表达式会更有意义:

      Another problem is non-lazy results list. It would make much more sense to use a generator expression:

      def run_eval(file_generator):
          trained_model = load_model()
          for file in file_generator:
              yield trained_model.eval(file[1])
      

      头节点是负责从此关联的文件系统中加载数据(可能会造成瓶颈),还是工作人员从中加载自己的数据?

      Is the head node responsible for loading the data from this associated file system (potentially creating a bottleneck), or do workers load their own data from it?

      不涉及中央处理.每个执行器进程(Python)/线程(JVM)都将加载其自己的数据集部分.

      There is no central processing involved. Each executor process (Python) / thread (JVM) will load its own part of the dataset.

      这篇关于使用mapPartitions时,binaryFiles何时加载到内存中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆