如何在R中读取Parquet并将其转换为R DataFrame? [英] How do I read a Parquet in R and convert it to an R DataFrame?

查看:282
本文介绍了如何在R中读取Parquet并将其转换为R DataFrame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在R编程中处理 Apache Parquet 文件(在我的情况下是在Spark中生成)语言.

I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language.

可以使用R阅读器吗?还是正在完成一项工作?

Is an R reader available? Or is work being done on one?

如果没有,到达那里的最方便的方法是什么?注意:有Java和C ++绑定: https://github.com/apache/parquet-mr

If not, what would be the most expedient way to get there? Note: There are Java and C++ bindings: https://github.com/apache/parquet-mr

推荐答案

您可以为此使用arrow包.这与Python pyarrow中的事情相同,但如今,它也已打包为R而不需要Python.由于它在CRAN上尚不可用,因此您必须先手动安装Arrow C ++:

You can use the arrow package for this. It is the same thing as in Python pyarrow but this nowadays also comes packaged for R without the need for Python. As it is not yet available on CRAN, you have to manually install Arrow C++ first:

git clone https://github.com/apache/arrow.git
cd arrow/cpp && mkdir release && cd release

# It is important to statically link to boost libraries
cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
make install

然后您可以安装R arrow软件包:

Then you can install the R arrow package:

devtools::install_github("apache/arrow/r")

并使用它来加载Parquet文件

And use it to load a Parquet file

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
#> The following objects are masked from 'package:base':
#> 
#>     array, table
read_parquet("somefile.parquet", as_tibble = TRUE)
#> # A tibble: 10 x 2
#>        x       y
#>    <int>   <dbl>
#> …

这篇关于如何在R中读取Parquet并将其转换为R DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆