将数据框火花到箭头 [英] Spark dataframe to arrow

查看:95
本文介绍了将数据框火花到箭头的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Python中使用Apache Arrow和Spark已有一段时间了,并且可以轻松地通过使用Pandas作为中介在数据框和Arrow对象之间进行转换。

I have been using Apache Arrow with Spark for a while in Python and have been easily able to convert between dataframes and Arrow objects by using Pandas as an intermediary.

但是,最近,我已经从Python转移到Scala以便与Spark交互,并且在Scala(Java)中使用Arrow并不像在Python中那样直观。我的基本需求是尽快将Spark数据框(或RDD,因为它们很容易转换)转换为Arrow对象。我最初的想法是先转换为Parquet,然后从Parquet转到Arrow,因为我记得可以从Parquet读取pyarrow。但是,如果我看错了,请纠正我,在查看了Arrow Java文档一段时间后,我找不到Parquet to Arrow函数。 Java版本中不存在此功能吗?还有另一种方法可以将Spark数据框传递给Arrow对象吗?也许将数据框的列转换为数组,然后转换为箭头对象?

Recently, however, I’ve moved from Python to Scala for interacting with Spark and using Arrow isn’t as intuitive in Scala (Java) as it is in Python. My basic need is to convert a Spark dataframe (or RDD since they’re easily convertible) to an Arrow object as quickly as possible. My initial thought was to convert to Parquet first and go from Parquet to Arrow since I remembered that pyarrow could read from Parquet. However, and please correct me if I’m wrong, after looking at the Arrow Java docs for a while I couldn’t find a Parquet to Arrow function. Does this function not exist in the Java version? Is there another way to get a Spark dataframe to an Arrow object? Perhaps converting the dataframe's columns to arrays then converting to arrow objects?

任何帮助将不胜感激。谢谢

Any help would be much appreciated. Thank you

编辑:找到了以下链接,该链接将实木复合地板架构转换为Arrow架构。但这似乎并没有像我需要的那样从镶木地板文件中返回Arrow对象:
https://github.com/apache/parquet-mr/blob/70f28810a5547219e18ffc3465f519c454fee6e5/parquet-arrow/src/ main / java / org / apache / parquet / arrow / schema / SchemaConverter.java

Found the following link that converts a parquet schema to an Arrow schema. But it doesn't seem to return an Arrow object from a parquet file like I need: https://github.com/apache/parquet-mr/blob/70f28810a5547219e18ffc3465f519c454fee6e5/parquet-arrow/src/main/java/org/apache/parquet/arrow/schema/SchemaConverter.java

推荐答案

Parquet<-> Arrow转换器尚未在Java中作为库提供。您可以在Dremio中查看基于Arrow的Parquet转换器( https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store / parquet )以获取灵感。我确信Apache Parquet项目会欢迎您为实现此功能而做出的贡献。

There is not a Parquet <-> Arrow converter available as a library in Java yet. You could have a look at the Arrow-based Parquet converter in Dremio (https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet) for inspiration. I am sure the Apache Parquet project would welcome your contribution implementing this functionality.

我们已经在C ++实现中为Parquet开发了Arrow读/写器:> https://github.com/apache/parquet-cpp/tree/master/src / parquet / arrow 。嵌套数据支持尚未完成,但是在接下来的6-12个月内,嵌套数据支持应该会更加完善(随着贡献者的加入,速度会更快)。

We have developed an Arrow reader/writer for Parquet in the C++ implementation: https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow. Nested data support is not complete yet, but it should be more complete within the next 6-12 months (sooner as contributors step up).

这篇关于将数据框火花到箭头的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆