如何使用dask / dask-cudf将单个大型实木复合地板文件读取到多个分区中? [英] How to read a single large parquet file into multiple partitions using dask/dask-cudf?
问题描述
我正在尝试使用 dask_cudf
/ <$读取单个大的 parquet
文件(大小> gpu_size) c $ c> dask ,但它目前正在将其读取到单个分区中,我猜这是从文档字符串推断出的预期行为:
I am trying to read a single large parquet
file (size > gpu_size), using dask_cudf
/dask
but it is currently reading it into a single partition, which i am guessing is the expected behavior inferring from the doc-string:
dask.dataframe.read_parquet(path, columns=None, filters=None, categories=None, index=None, storage_options=None, engine='auto', gather_statistics=None, **kwargs):
Read a Parquet file into a Dask DataFrame
This reads a directory of Parquet data into a Dask.dataframe, one file per partition.
It selects the index among the sorted columns if any exist.
是否有解决方法,我可以将其读入多个分区?
Is there a work-around i can do read it into multiple partitions ?
推荐答案
镶木地板数据集可以保存到单独的文件中。每个文件可以包含单独的行组。 Dask Dataframe将每个Parquet行组读入一个单独的分区。
Parquet datasets can be saved into separate files. Each file may contain separate row groups. Dask Dataframe reads each Parquet row group into a separate partition.
根据您的说法,听起来您的数据集只有一个行组。如果真是这样,那么不幸的是,Dask在这里根本无法做任何事情。
Based on what you're saying it sounds like your dataset has only a single row group. If that is the case then unfortunately there is nothing that Dask can really do here.
您可能想返回到数据源以查看其保存方式,并验证保存该数据集的任何过程是否以某种方式保存该数据集不会创建非常大的行组。
You might want to go back to the source of the data to see how it was saved and verify that whatever process is saving this dataset does it in a way where it is not creating very large row groups.
这篇关于如何使用dask / dask-cudf将单个大型实木复合地板文件读取到多个分区中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!