从 R 中的 CSV 文件创建镶木地板文件目录 [英] Create parquet file directory from CSV file in R

查看:63
本文介绍了从 R 中的 CSV 文件创建镶木地板文件目录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到越来越多的情况,需要使用内存不足 (OOM) 方法在 R 中进行数据分析.我熟悉其他 OOM 方法,例如 sparklyrDBI 但我最近遇到了 arrow 并想进一步探索它.

问题是我通常使用的平面文件足够大,在没有帮助的情况下无法将它们读入 R.因此,理想情况下,我更喜欢一种进行转换的方法,而实际上不需要首先将数据集读入 R.

如果您能提供任何帮助,我们将不胜感激!

解决方案

arrow::open_dataset() 可以处理文件目录并查询它们,而无需将所有内容读入内存.如果您确实想将数据重写到多个文件中,可能被数据中的一列或多列分区,您可以将 Dataset 对象传递给 write_dataset().

一个(临时)警告:从 {arrow} 3.0.0 开始,open_dataset() 只接受一个目录,而不是一个单一的文件路径.我们计划在下一个版本中接受单个文件路径或离散文件路径列表(请参阅 问题),但现在如果您只需要读取包含其他非数据文件的目录中的单个文件,您需要将其移动/符号链接到一个新目录并打开它.>

I'm running into more and more situations where I need out-of-memory (OOM) approaches to data analytics in R. I am familiar with other OOM approaches, like sparklyr and DBI but I recently came across arrow and would like to explore it more.

The problem is that the flat files I typically work with are sufficiently large that they cannot be read into R without help. So, I would ideally prefer a way to make the conversion without actually need to read the dataset into R in the first place.

Any help you can provide would be much appreciated!

解决方案

arrow::open_dataset() can work on a directory of files and query them without reading everything into memory. If you do want to rewrite the data into multiple files, potentially partitioned by one or more columns in the data, you can pass the Dataset object to write_dataset().

One (temporary) caveat: as of {arrow} 3.0.0, open_dataset() only accepts a directory, not a single file path. We plan to accept a single file path or list of discrete file paths in the next release (see issue), but for now if you need to read only a single file that is in a directory with other non-data files, you'll need to move/symlink it into a new directory and open that.

这篇关于从 R 中的 CSV 文件创建镶木地板文件目录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆