CUDF错误处理大量实木复合地板文件 [英] CUDF error processing a large number of parquet files
问题描述
我的目录中有2000个实木复合地板文件。每个实木复合地板文件的大小约为20MB。使用的压缩为SNAPPY。每个镶木地板文件都有如下行:
I have 2000 parquet files in a directory. Each parquet file is roughly 20MB in size. The compression used is SNAPPY. Each parquet file has rows that look like the following:
+------------+-----------+-----------------+
| customerId | productId | randomAttribute |
+------------+-----------+-----------------+
| ID1 | PRODUCT1 | ATTRIBUTE1 |
| ID2 | PRODUCT2 | ATTRIBUTE2 |
| ID2 | PRODUCT3 | ATTRIBUTE3 |
+------------+-----------+-----------------+
每个列条目都是一个字符串。
我正在使用具有以下配置的p3.8xlarge EC2实例:
Each column entry is a string. I am using p3.8xlarge EC2 instance with the following configurations:
- RAM :244GB
- vCPU :32
- GPU RAM :64GB(每个GPU内核具有16GB RAM )
- GPU :4个Tesla V100
- RAM: 244GB
- vCPU: 32
- GPU RAM: 64GB (each GPU core has 16GB of RAM)
- GPUs: 4 Tesla V100
我正在尝试以下代码:
def read_all_views(parquet_file_lst):
df_lst = []
for file in parquet_file_lst:
df = cudf.read_parquet(file, columns=['customerId', 'productId'])
df_lst.append(df)
return cudf.concat(df_lst)
在处理前180个文件后出现以下运行时错误,这会导致崩溃:
This crashes after processing the first 180 files with the following runtime error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 9, in read_all_views
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/cudf/io/parquet.py", line 54, in read_parquet
use_pandas_metadata,
File "cudf/_lib/parquet.pyx", line 25, in
cudf._lib.parquet.read_parquet
File "cudf/_lib/parquet.pyx", line 80, in cudf._lib.parquet.read_parquet
RuntimeError: rmm_allocator::allocate(): RMM_ALLOC: unspecified launch failure
在任何给定时间都仅使用GPU和CPU RAM的10%。
关于如何调试它的任何想法,或者相同的解决方法是什么?
Only 10% of both GPU and the CPU RAM is utilized at any given time. Any ideas how to debug this or what are the workarounds for the same?
推荐答案
cuDF是单个GPU库。 2000个20 MB的文件大约需要40 GB的数据,这比单个V100 GPU的内存容量还大。
cuDF is a single GPU library. 2000 files of 20 MB would be about 40 GB of data, which is more than you can fit in memory in a single V100 GPU.
对于需要更多单个文件的工作流GPU,cuDF依赖于Dask。以下示例说明了如何使用cuDF + Dask将数据读取到单个节点中具有多个GPU的分布式GPU内存中。这不会回答您的调试问题,但有望解决您的问题。
For workflows that require more a single GPU, cuDF relies on Dask. The following example illustrates how you could use cuDF + Dask to read data into distributed GPU memory with multiple GPUs in a single node. This doesn't answer your debugging question, but should hopefully solve your problem.
首先,我使用几行代码来创建由两个GPU组成的Dask集群。 / p>
First, I use a few lines of code to create a Dask cluster of two GPUs.
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import dask_cudf
cluster = LocalCUDACluster() # by default use all GPUs in the node. I have two.
client = Client(cluster)
client
# The print output of client:
#
# Client
# Scheduler: tcp://127.0.0.1:44764
# Dashboard: http://127.0.0.1:8787/status
# Cluster
# Workers: 2
# Cores: 2
# Memory: 404.27 GB
接下来,我将为它们创建几个镶木文件
Next I'll create a couple of parquet files for this example.
import os
import cudf
from cudf.datasets import randomdata
if not os.path.exists('example_output'):
os.mkdir('example_output')
for x in range(2):
df = randomdata(nrows=10000,
dtypes={'a':int, 'b':str, 'c':str, 'd':int},
seed=12)
df.to_parquet('example_output/df')
我们来看一下 nvidia-smi
上每个GPU的内存。
Let's look at the memory on each of my GPUs with nvidia-smi
.
nvidia-smi
Thu Sep 26 19:13:46 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:AF:00.0 Off | 0 |
| N/A 51C P0 29W / 70W | 6836MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:D8:00.0 Off | 0 |
| N/A 47C P0 28W / 70W | 5750MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
注意两个值。 GPU 0上为6836 MB,GPU 1上为5750 MB(我碰巧在这些GPU上的内存中已经有无关的数据)。现在,让我们使用Dask cuDF阅读两个镶木地板文件的整个目录,然后 persist
。坚持下去将迫使计算工作-达斯执行是懒惰的,因此仅调用 read_parquet
只会向任务图中添加一个任务。 ddf
是Dask DataFrame。
Notice the two values. 6836 MB on GPU 0 and 5750 MB on GPU 1 (I happen to have unrelated data already in memory on these GPUs). Now let's read our entire directory of two parquet files with Dask cuDF and then persist
it. Persisting it forces computation -- Dask execution is lazy so just calling read_parquet
only adds a task to the task graph. ddf
is a Dask DataFrame.
ddf = dask_cudf.read_parquet('example_output/df')
ddf = ddf.persist()
现在让我们再次看看 nvidia-smi
。
Thu Sep 26 19:13:52 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:AF:00.0 Off | 0 |
| N/A 51C P0 29W / 70W | 6938MiB / 15079MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:D8:00.0 Off | 0 |
| N/A 47C P0 28W / 70W | 5852MiB / 15079MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Dask可以为我们在两个GPU之间分配数据。
Dask handles distributing our data across both GPUs for us.
这篇关于CUDF错误处理大量实木复合地板文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!