存在GPU的情况下,如何在TensorFlow中的单个脚本中训练多个模型? [英] How does one train multiple models in a single script in TensorFlow when there are GPUs present?

查看:141
本文介绍了存在GPU的情况下,如何在TensorFlow中的单个脚本中训练多个模型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我可以在一台机器上访问多个GPU(出于争论的考虑,假设一台机器上有一定数量的RAM和磁盘,每个8GPU的最大内存为8GB).我想在一个脚本中运行,并在一台机器上运行一个程序,该程序在TensorFlow中评估多个模型(例如50或200),每个模型都有不同的超参数设置(例如,步长,衰减)费率,批量大小,时间段/迭代次数等).在训练结束时,假设我们只是记录它的准确性并删除模型(如果您想假设该模型经常被检查为目标,那么最好扔掉该模型并从头开始进行训练.您也可以假设可能还会记录其他一些数据,例如特定的超参数,训练,验证,在训练时记录训练错误等.

Say I have access to a number of GPUs in a single machine (for the sake of argument assume 8GPUs each with max memory of 8GB each in one single machine with some amount of RAM and disk). I wanted to run in one single script and in one single machine a program that evaluates multiple models (say 50 or 200) in TensorFlow, each with a different hyper parameter setting (say, step-size, decay rate, batch size, epochs/iterations, etc). At the end of training assume we just record its accuracy and get rid of the model (if you want assume the model is being check pointed every so often, so its fine to just throw away the model and start training from scratch. You may also assume some other data may be recorded like the specific hyper params, train, validation, train errors are recorded as we train etc).

当前,我有一个(伪)脚本,其外观如下:

Currently I have a (pseudo-)script that looks as follow:

def train_multiple_modles_in_one_script_with_gpu(arg):
    '''
    trains multiple NN models in one session using GPUs correctly.

    arg = some obj/struct with the params for trianing each of the models.
    '''
    #### try mutliple models
    for mdl_id in range(100):
        #### define/create graph
        graph = tf.Graph()
        with graph.as_default():
            ### get mdl
            x = tf.placeholder(float_type, get_x_shape(arg), name='x-input')
            y_ = tf.placeholder(float_type, get_y_shape(arg))
            y = get_mdl(arg,x)
            ### get loss and accuracy
            loss, accuracy = get_accuracy_loss(arg,x,y,y_)
            ### get optimizer variables
            opt = get_optimizer(arg)
            train_step = opt.minimize(loss, global_step=global_step)
        #### run session
        with tf.Session(graph=graph) as sess:
            # train
            for i in range(nb_iterations):
                batch_xs, batch_ys = get_batch_feed(X_train, Y_train, batch_size)
                sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys})
                # check_point mdl
                if i % report_error_freq == 0:
                    sess.run(step.assign(i))
                    #
                    train_error = sess.run(fetches=loss, feed_dict={x: X_train, y_: Y_train})
                    test_error = sess.run(fetches=loss, feed_dict={x: X_test, y_: Y_test})
                    print( 'step %d, train error: %s test_error %s'%(i,train_error,test_error) )

基本上,它一次运行会尝试许多模型,但会在单独的图形中构建每个模型,并在单独的会话中运行每个模型.

essentially it tries lots of models in one single run but it builds each model in a separate graph and runs each one in a separate session.

我想我主要担心的是,我不清楚它的幕后张量流如何为要使用的GPU分配资源.例如,是否仅在运行会话时才加载(部分)数据集?创建图形和模型时,是立即将其引入GPU还是将其插入GPU?每次尝试使用新模型时,我都需要清除/释放GPU吗?实际上,我不太在乎这些模型是否在多个GPU中并行运行(这可能是一个不错的补充),但是我希望它首先以串行方式运行所有程序而不会崩溃.要使它正常工作,我需要做些什么特别的事情吗?

I guess my main worry is that its unclear to me how tensorflow under the hood allocates resources for the GPUs to be used. For example, does it load the (part of the) data set only when a session is ran? When I create a graph and a model, is it brought in the GPU immediately or when is it inserted in the GPU? Do I need to clear/free the GPU each time it tries a new model? I don't actually care too much if the models are ran in parallel in multiple GPU (which can be a nice addition), but I want it to first run everything serially without crashing. Is there anything special I need to do for this to work?

目前,我收到一个错误消息,其开始如下:

Currently I am getting an error that starts as follow:

I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit:                   340000768
InUse:                   336114944
MaxInUse:                339954944
NumAllocs:                      78
MaxAllocSize:            335665152

W tensorflow/core/common_runtime/bfc_allocator.cc:274] ***************************************************xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 160.22MiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[60000,700]

进一步说:

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[60000,700]
         [[Node: standardNN/NNLayer1/Z1/add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](standardNN/NNLayer1/Z1/MatMul, b1/read)]]

I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:06:00.0)

然而,在输出文件(打印位置)的更下方,似乎可以很好地打印出应该在培训进行过程中显示的错误/消息.这是否意味着它没有耗尽资源?还是实际上可以使用GPU?如果能够使用CPU而不是CPU,为什么仅在将要使用GPU的情况下才出现此错误?

however further down the output file (where it prints) it seems to print fine the errors/messages that should show as training proceeds. Does this mean that it didn't run out of resources? Or was it actually able to use the GPU? If it was able to use the CPU instead of the CPU, when why is this an error only happening when GPU are about to be used?

奇怪的是,数据集实际上并不是那么大(所有60K点均为24.5M),当我在自己的计算机上本地运行单个模型时,该进程似乎使用了不到5GB的内存. GPU至少有8GB,而装有GPU的计算机则有足够的RAM和磁盘(至少16GB).因此,张量流给我带来的错误非常令人困惑.它试图做什么,为什么会发生呢?有什么想法吗?

The weird thing is that the data set is really not that big (all 60K points are 24.5M) and when I run a single model locally in my own computer it seems that the process uses less than 5GB. The GPUs have at least 8GB and the computer with them has plenty of RAM and disk (at least 16GB). Thus, the errors that tensorflow is throwing at me are quite puzzling. What is it trying to do and why are they occurring? Any ideas?

在阅读了有关建议使用多处理库的答案之后,我想到了以下脚本:

After reading the answer that suggests to use the multiprocessing library I came up with the following script:

def train_mdl(args):
    train(mdl,args)

if __name__ == '__main__':
    for mdl_id in range(100):
        # train one model with some specific hyperparms (assume they are chosen randomly inside the funciton bellow or read from a config file or they could just be passed or something)
        p = Process(target=train_mdl, args=(args,))
        p.start()
        p.join()
    print('Done training all models!')

老实说,我不确定他的答案为什么建议使用池,或者为什么有奇怪的元组括号,但这对我来说是有意义的.每次在上述循环中创建新进程时,是否会重新分配用于tensorflow的资源?

honestly I am not sure why his answer suggests to use pool, or why there are weird tuple brackets but this is what would make sense for me. Would the resources for tensorflow be re-allocated every time a new process is created in the above loop?

推荐答案

我认为,从长远来看,在一个脚本中运行所有模型可能不是一个好习惯(请参阅下面的建议,以获得更好的选择).但是,如果您愿意这样做,这是一个解决方案:您可以使用multiprocessing模块将TF会话封装到一个进程中,这将确保TF一旦完成该过程即可释放会话内存.这是一个代码段:

I think that running all models in one single script can be bad practice in the long term (see my suggestion below for a better alternative). However, if you would like to do it, here is a solution: You can encapsulate your TF session into a process with the multiprocessing module, this will make sure TF releases the session memory once the process is done. Here is a code snippet:

from multiprocessing import Pool
import contextlib
def my_model((param1, param2, param3)): # Note the extra (), required by the pool syntax
    < your code >

num_pool_worker=1 # can be bigger than 1, to enable parallel execution 
with contextlib.closing(Pool(num_pool_workers)) as po: # This ensures that the processes get closed once they are done
     pool_results = po.map_async(my_model,
                                    ((param1, param2, param3)
                                     for param1, param2, param3 in params_list))
     results_list = pool_results.get()

OP中的注意事项:如果选择使用随机数生成器种子,则它不会随多重处理库自动重置.此处的详细信息:对每个进程使用具有不同随机种子的python多重处理处理

Note from OP: The random number generator seed does not reset automatically with the multi-processing library if you choose to use it. Details here: Using python multiprocessing with different random seed for each process

关于TF资源分配:通常TF分配的资源要比其需要的多得多.很多时候,您可以限制每个进程使用总GPU内存的一小部分,并通过反复试验发现脚本所需的部分.

About TF resource allocation: Usually TF allocates much more resources than it needs. Many times you can restrict each process to use a fraction of the total GPU memory, and discover through trial and error the fraction your script requires.

您可以使用以下代码段

gpu_memory_fraction = 0.3 # Choose this number through trial and error
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_memory_fraction,)
session_config = tf.ConfigProto(gpu_options=gpu_options)
sess = tf.Session(config=session_config, graph=graph)

请注意,有时TF会增加内存使用量,以加快执行速度.因此,减少内存使用量可能会使模型运行速度变慢.

Note that sometimes TF increases the memory usage in order to accelerate the execution. Therefore, reducing the memory usage might make your model run slower.

在您的修改/评论中回答新问题:

  1. 是的,每次创建新流程时,都会重新分配Tensorflow,并在流程结束后将其清除.

  1. Yes, Tensorflow will be re-allocated every time a new process is created, and cleared once a process ends.

您的编辑中的for循环也应该起作用.我建议改用Pool,因为它将使您能够在单个GPU上同时运行多个模型.请参阅有关设置gpu_memory_fraction和选择最大进程数"的说明.还要注意:(1)Pool映射为您运行循环,因此一旦使用它就不需要外部的for循环. (2)在您的示例中,在调用train()

The for-loop in your edit should also do the job. I suggest to use Pool instead, because it will enable you to run several models concurrently on a single GPU. See my notes about setting gpu_memory_fraction and "choosing the maximal number of processes". Also note that: (1) The Pool map runs the loop for you, so you don't need an outer for-loop once you use it. (2) In your example, you should have something like mdl=get_model(args) before calling train()

奇怪的元组括号:Pool仅接受单个参数,因此我们使用元组传递多个参数.请参见 multiprocessing.pool.map和具有两个参数的函数以获得更多详细信息.正如一个答案中所建议的那样,您可以使用

Weird tuple parenthesis: Pool only accepts a single argument, therefore we use a tuple to pass multiple arguments. See multiprocessing.pool.map and function with two arguments for more details. As suggested in one answer, you can make it more readable with

def train_mdl(params):
    (x,y)=params
    < your code >

  • 按照@Seven的建议,您可以使用CUDA_VISIBLE_DEVICES环境变量来选择要用于进程的GPU.您可以在python脚本中使用以下代码(在流程函数(train_mdl)开头)进行操作.

  • As @Seven suggested, you can use CUDA_VISIBLE_DEVICES environment variable to choose which GPU to use for your process. You can do it from within your python script using the following on the beginning of the process function (train_mdl).

    import os # the import can be on the top of the python script
    os.environ["CUDA_VISIBLE_DEVICES"] = "{}".format(gpu_id)
    

  • 执行实验的更好做法是将训练/评估代码与超参数/模型搜索代码隔离. 例如.有一个名为train.py的脚本,该脚本接受超参数和对数据的引用的特定组合作为参数,并针对单个模型执行训练.

    A better practice for executing your experiments would be to isolate your training/evaluation code from the hyper parameters/ model search code. E.g. have a script named train.py, which accepts a specific combination of hyper parameters and references to your data as arguments, and executes training for a single model.

    然后,要遍历所有可能的参数组合,您可以使用简单任务(作业)队列,并将超参数的所有可能组合作为单独的作业提交.任务队列将一次将您的作业送入您的计算机.通常,您还可以将队列设置为同时执行多个进程(请参见下面的详细信息).

    Then, to iterate through the all the possible combinations of parameters you can use a simple task (jobs) queue, and submit all the possible combinations of hyper-parametrs as separate jobs. The task queue will feed your jobs one at a time to your machine. Usually, you can also set the queue to execute number of processes concurrently (see details below).

    具体来说,我使用任务后台处理程序,它非常易于安装和少数(不需要管理员权限,下面有详细信息).

    Specifically, I use task spooler, which is super easy to install and handful (doesn't requires admin privileges, details below).

    基本用法是(请参阅下面有关任务假脱机程序用法的注释):

    Basic usage is (see notes below about task spooler usage):

    ts <your-command>
    

    实际上,我有一个单独的python脚本来管理实验,为每个特定实验设置所有参数,然后将作业发送到ts队列.

    In practice, I have a separate python script that manages my experiments, set all the arguments per specific experiment and send the jobs to the ts queue.

    以下是我的实验管理员提供的一些相关的python代码片段:

    run_bash执行bash命令

    def run_bash(cmd):
        p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, executable='/bin/bash')
        out = p.stdout.read().strip()
        return out  # This is the stdout from the shell command
    

    下一个代码段设置要运行的并发进程数(请参阅以下有关选择最大进程数的注释):

    The next snippet sets the number of concurrent processes to be run (see note below about choosing the maximal number of processes):

    max_job_num_per_gpu = 2
    run_bash('ts -S %d'%max_job_num_per_gpu)
    

    下一个代码片段遍历超级参数/模型参数的所有组合的列表.列表的每个元素都是一个字典,其中的键是train.py脚本的命令行参数

    The next snippet iterates through a list of all combinations of hyper params / model params. Each element of the list is a dictionary, where the keys are the command line arguments for the train.py script

    for combination_dict in combinations_list:
    
        job_cmd = 'python train.py ' + '  '.join(
                ['--{}={}'.format(flag, value) for flag, value in combination_dict.iteritems()])
    
        submit_cmd = "ts bash -c '%s'" % job_cmd
        run_bash(submit_cmd)
    

    有关选择最大进程数的说明:

    如果您缺少GPU,则可以使用找到的gpu_memory_fraction将进程数设置为max_job_num_per_gpu=int(1/gpu_memory_fraction)

    If you are short on GPUs, you can use gpu_memory_fraction you found, to set the number of processes as max_job_num_per_gpu=int(1/gpu_memory_fraction)

    有关任务后台处理程序(ts)的注意事项:

    Notes about task spooler (ts):

    1. 您可以使用以下命令设置要运行的并发进程数(插槽"):

    1. You could set the number of concurrent processes to run ("slots") with:

    ts -S <number-of-slots>

    安装ts不需要管理员权限.您可以使用简单的make从源代码下载并编译它,然后将其添加到您的路径中.

    Installing ts doesn't requires admin privileges. You can download and compile it from source with a simple make, add it to your path and you're done.

    您可以通过以下方式设置多个队列(我将其用于多个GPU)

    You can set up multiple queues (I use it for multiple GPUs), with

    TS_SOCKET=<path_to_queue_name> ts <your-command>

    例如

    TS_SOCKET=/tmp/socket-ts.gpu_queue_1 ts <your-command>

    TS_SOCKET=/tmp/socket-ts.gpu_queue_2 ts <your-command>

    有关更多用法示例,请参见此处 p>

    See here for further usage example

    有关自动设置路径名和文件名的说明: 一旦您将主要代码与实验管理器分开,您将需要一种有效的方式来生成文件名和目录名(在给定超参数的情况下).我通常将重要的超级参数保存在字典中,并使用以下函数从字典键值对生成单个链接的字符串. 这是我用来执行此操作的功能:

    A note about automatically setting the path names and file names: Once you separate your main code from the experiment manager, you will need an efficient way to generate file names and directory names, given the hyper-params. I usually keep my important hyper params in a dictionary and use the following function to generate a single chained string from the dictionary key-value pairs. Here are the functions I use for doing it:

    def build_string_from_dict(d, sep='%'):
        """
         Builds a string from a dictionary.
         Mainly used for formatting hyper-params to file names.
         Key-value pairs are sorted by the key name.
    
        Args:
            d: dictionary
    
        Returns: string
        :param d: input dictionary
        :param sep: key-value separator
    
        """
    
        return sep.join(['{}={}'.format(k, _value2str(d[k])) for k in sorted(d.keys())])
    
    
    def _value2str(val):
        if isinstance(val, float): 
            # %g means: "Floating point format.
            # Uses lowercase exponential format if exponent is less than -4 or not less than precision,
            # decimal format otherwise."
            val = '%g' % val
        else:
            val = '{}'.format(val)
        val = re.sub('\.', '_', val)
        return val
    

    这篇关于存在GPU的情况下,如何在TensorFlow中的单个脚本中训练多个模型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    相关文章
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆