如何估计基于推力的实现的GPU内存需求? [英] How to estimate GPU memory requirements for thrust based implementation?

查看:159
本文介绍了如何估计基于推力的实现的GPU内存需求?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有3个不同的基于推力的实现,执行某些计算:第一是最慢,需要最少的GPU内存,第二是最快,需要最大的GPU内存,第三个是中间。对于每个人,我知道每个设备向量使用的大小和数据类型,所以我使用vector.size()* sizeof(type)粗略估计存储需要的内存。

I have 3 different thrust-based implementations that perform certain calculations: first is the slowest and requires the least of GPU memory, second is the fastest and requires the most of GPU memory, and the third one is in-between. For each of those I know the size and data type for each device vector used so I am using vector.size()*sizeof(type) to roughly estimate the memory needed for storage.

所以对于给定的输入,基于其大小,我想决定使用哪个实现。换句话说,确定最适合的实现是在可用的GPU内存中。

So for a given input, based on its size, I would like to decide which implementation to use. In other words, determine the fastest implementation that will fit is in the available GPU memory.

我认为对于我处理的很长的向量,我计算的vector.data()是一个相当不错的估计,其余的开销(如果有)可以忽略。

I think that for very long vectors that I am dealing with, the size of the vector.data() that I am calculating is a fairly good estimate and the rest of the overhead (if any) could be disregarded.

但是如何估计与推力算法实现相关的内存使用开销(如果有的话)?具体来说,我正在寻找这样的估计为transform,copy,reduce,reduce_by_key和gather。我不是真的关心静态的开销,并且不是算法输入和输出参数大小的函数,除非它非常重要。

But how would I estimate the memory usage overhead (if any) associated with the thrust algorithms implementation? Specifically I am looking for such estimates for transform, copy, reduce, reduce_by_key, and gather. I do not really care about the overhead that is static and is not a function of the algorithm input and output parameters sizes unless it’s very significant.

我理解GPU内存碎片等等,但让我们暂时离开。

I understand the implication of the GPU memory fragmentation, etc. but let’s leave this aside for a moment.

非常感谢您花时间来研究这个问题。

Thank you very much for taking the time to look into this.

推荐答案

Thrust旨在像黑盒一样使用,并且没有我知道的各种算法的内存开销的文档。但它并不是听起来像一个非常困难的问题,通过运行几个数值实验经验推导出来。您可能期望特定算法的内存消耗近似为:

Thrust is intended to be used like a black box and there is no documentation of the memory overheads of the various algorithms that I am aware of. But it doesn't sound like a very difficult problem to deduce it empirically by running a few numerical experiments. You might expect the memory consumption of a particular alogrithm to be approximable as:

total number of words of memory consumed = a + (1 + b)*N

N 输入字。这里 a 将是算法的固定开销, 1 + b 最佳适合存储器的斜率对 N 线。 b 是每个输入字的算法的开销量。

for a problem with N input words. Here a will be the fixed overhead of the algorithm and 1+b the slope of best fit memory versus N line. b is then the amount of overhead the algorithm per input word.

因此,问题变成如何监视给定算法的内存使用情况。 Thrust使用内部辅助函数 get_temporary_buffer 分配内部内存。最好的想法是写自己的实现 get_temporary_buffer ,它发出它被调用的大小,并且(可能)使用 cudaGetMemInfo 在调用函数时获取上下文内存统计信息。你可以看到一些具体的例子,如何拦截 get_temporary_buffer 调用 here

So the question then becomes how to monitor the memory usage of a given algorithm. Thrust uses an internal helper function get_temporary_buffer to allocate internal memory. The best idea would be to writeyour own implementation of get_temporary_buffer which emits the size it has been called with, and (perhaps) uses a call to cudaGetMemInfo to get context memory statistics at the time the function gets called. You can see some concrete examples of how to intercept get_temporary_buffer calls here.

使用适当的检测分配器,并且有一些运行在几个不同的问题大小,你应该能够适应模型并估计给定算法的 b 值。然后,可以在您的代码中使用该模型来确定给定内存的安全最大问题大小。

With a suitably instrumented allocator and some runs with it at a few different problem sizes, you should be able to fit the model above and estimate the b value for a given algorithm. The model can then be used in your code to determine safe maximum problem sizes for a given about of memory.

我希望这是您问的问题...

I hope this is what you were asking about...

这篇关于如何估计基于推力的实现的GPU内存需求?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆