CUDF错误处理大量实木复合地板文件 [英] CUDF error processing a large number of parquet files

查看:79
本文介绍了CUDF错误处理大量实木复合地板文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目录中有2000个实木复合地板文件。每个实木复合地板文件的大小约为20MB。使用的压缩为SNAPPY。每个镶木地板文件都有如下行:

I have 2000 parquet files in a directory. Each parquet file is roughly 20MB in size. The compression used is SNAPPY. Each parquet file has rows that look like the following:

+------------+-----------+-----------------+
| customerId | productId | randomAttribute |
+------------+-----------+-----------------+
| ID1        | PRODUCT1  | ATTRIBUTE1      |
| ID2        | PRODUCT2  | ATTRIBUTE2      |
| ID2        | PRODUCT3  | ATTRIBUTE3      |
+------------+-----------+-----------------+

每个列条目都是一个字符串。
我正在使用具有以下配置的p3.8xlarge EC2实例:

Each column entry is a string. I am using p3.8xlarge EC2 instance with the following configurations:


  • RAM :244GB

  • vCPU :32

  • GPU RAM :64GB(每个GPU内核具有16GB RAM )

  • GPU :4个Tesla V100

  • RAM: 244GB
  • vCPU: 32
  • GPU RAM: 64GB (each GPU core has 16GB of RAM)
  • GPUs: 4 Tesla V100

我正在尝试以下代码:

def read_all_views(parquet_file_lst):
    df_lst = []    
    for file in parquet_file_lst:
        df = cudf.read_parquet(file, columns=['customerId', 'productId'])
        df_lst.append(df)
    return cudf.concat(df_lst)

在处理前180个文件后出现以下运行时错误,这会导致崩溃:

This crashes after processing the first 180 files with the following runtime error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 9, in read_all_views
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/cudf/io/parquet.py", line 54, in read_parquet
    use_pandas_metadata,
File "cudf/_lib/parquet.pyx", line 25, in 
cudf._lib.parquet.read_parquet
File "cudf/_lib/parquet.pyx", line 80, in cudf._lib.parquet.read_parquet
RuntimeError: rmm_allocator::allocate(): RMM_ALLOC: unspecified launch failure

在任何给定时间都仅使用GPU和CPU RAM的10%。
关于如何调试它的任何想法,或者相同的解决方法是什么?

Only 10% of both GPU and the CPU RAM is utilized at any given time. Any ideas how to debug this or what are the workarounds for the same?

推荐答案

cuDF是单个GPU库。 2000个20 MB的文件大约需要40 GB的数据,这比单个V100 GPU的内存容量还大。

cuDF is a single GPU library. 2000 files of 20 MB would be about 40 GB of data, which is more than you can fit in memory in a single V100 GPU.

对于需要更多单个文件的工作流GPU,cuDF依赖于Dask。以下示例说明了如何使用cuDF + Dask将数据读取到单个节点中具有多个GPU的分布式GPU内存中。这不会回答您的调试问题,但有望解决您的问题。

For workflows that require more a single GPU, cuDF relies on Dask. The following example illustrates how you could use cuDF + Dask to read data into distributed GPU memory with multiple GPUs in a single node. This doesn't answer your debugging question, but should hopefully solve your problem.

首先,我使用几行代码来创建由两个GPU组成的Dask集群。 / p>

First, I use a few lines of code to create a Dask cluster of two GPUs.

from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import dask_cudf

cluster = LocalCUDACluster() # by default use all GPUs in the node. I have two.
client = Client(cluster)
client
# The print output of client:
# 
# Client
# Scheduler: tcp://127.0.0.1:44764
# Dashboard: http://127.0.0.1:8787/status

# Cluster
# Workers: 2
# Cores: 2
# Memory: 404.27 GB

接下来,我将为它们创建几个镶木文件

Next I'll create a couple of parquet files for this example.

import os

import cudf
from cudf.datasets import randomdata

if not os.path.exists('example_output'):
    os.mkdir('example_output')

for x in range(2):
    df = randomdata(nrows=10000,
                dtypes={'a':int, 'b':str, 'c':str, 'd':int},
                seed=12)
    df.to_parquet('example_output/df')

我们来看一下 nvidia-smi 上每个GPU的内存。

Let's look at the memory on each of my GPUs with nvidia-smi.

nvidia-smi
Thu Sep 26 19:13:46 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:AF:00.0 Off |                    0 |
| N/A   51C    P0    29W /  70W |   6836MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:D8:00.0 Off |                    0 |
| N/A   47C    P0    28W /  70W |   5750MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

注意两个值。 GPU 0上为6836 MB,GPU 1上为5750 MB(我碰巧在这些GPU上的内存中已经有无关的数据)。现在,让我们使用Dask cuDF阅读两个镶木地板文件的整个目录,然后 persist 。坚持下去将迫使计算工作-达斯执行是懒惰的,因此仅调用 read_parquet 只会向任务图中添加一个任务。 ddf 是Dask DataFrame。

Notice the two values. 6836 MB on GPU 0 and 5750 MB on GPU 1 (I happen to have unrelated data already in memory on these GPUs). Now let's read our entire directory of two parquet files with Dask cuDF and then persist it. Persisting it forces computation -- Dask execution is lazy so just calling read_parquet only adds a task to the task graph. ddf is a Dask DataFrame.

ddf = dask_cudf.read_parquet('example_output/df')
ddf = ddf.persist()

现在让我们再次看看 nvidia-smi

Thu Sep 26 19:13:52 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:AF:00.0 Off |                    0 |
| N/A   51C    P0    29W /  70W |   6938MiB / 15079MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:D8:00.0 Off |                    0 |
| N/A   47C    P0    28W /  70W |   5852MiB / 15079MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Dask可以为我们在两个GPU之间分配数据。

Dask handles distributing our data across both GPUs for us.

这篇关于CUDF错误处理大量实木复合地板文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆