如何在Apache Beam中使用Pandas? [英] How to use Pandas in apache beam?

查看:27
本文介绍了如何在Apache Beam中使用Pandas?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在 Apache Beam 中实现 Pandas?我无法对多列执行左连接,并且 Pcollections 不支持 sql 查询.即使是 Apache Beam 文档也没有正确地构建.我检查过,但在 Apache 光束中找不到任何类型的 Panda 实现.任何人都可以将我指向所需的链接吗?

How to implement Pandas in Apache beam ? I cannot perform left join on multiple columns and Pcollections does not support sql queries. Even the Apache Beam document is not properly framed. I checked but couldn't find any kind of Panda implementation in Apache beam. Can anyone direct me to the desired link ?

推荐答案

这里有些混乱.

pandas 是受支持的",从某种意义上说,您可以像在没有 Apache Beam 的情况下一样使用 pandas 库,并且以相同的方式使用它只要指定了正确的依赖项,您就可以使用 Beam 管道中的任何其他库.它也是受支持"的,因为它默认捆绑为依赖项,因此您不必自己指定它.例如,您可以编写一个 DoFn,使用 pandas 为每个元素执行一些计算;对每个元素进行单独计算,由 Beam 在所有元素上并行执行.

pandas is "supported", in the sense that you can use the pandas library the same way you'd be using it without Apache Beam, and the same way you can use any other library from your Beam pipeline as long as you specify the proper dependencies. It is also "supported" in the sense that it is bundled as a dependency by default so you don't have to specify it yourself. For example, you can write a DoFn that performs some computation using pandas for every element; a separate computation for each element, performed by Beam in parallel over all elements.

支持,因为 Apache Beam 目前没有提供与它的特殊集成,例如您不能将 PCollection 用作 Pandas 数据框,反之亦然.PCollection 物理上不包含任何数据(这对于流式管道应该特别清楚)——它只是 Beam 执行计划中的一个占位节点.

It is not supported in the sense that Apache Beam currently provides no special integration with it, e.g. you can't use a PCollection as a pandas dataframe, or vice versa. A PCollection does not physically contain any data (this should be particularly clear for streaming pipelines) - it is just a placeholder node in Beam's execution plan.

也就是说,用于处理 Beam PCollections 的类似 pandas 的 API 肯定是一个好主意,并且可以简化许多现有 pandas<的 Beam 学习/code> 用户,但我认为目前没有人致力于实现这一点.但是,Beam 社区目前正在讨论向 PCollections 添加模式的想法,这是朝这个方向迈出的一步.

That said, a pandas-like API for working with Beam PCollections would certainly be a good idea, and would simplify learning Beam for many existing pandas users, but I don't think anybody is working on implementing this currently. However, the Beam community is currently discussing the idea of adding schemas to PCollections, which is a step in this direction.

这篇关于如何在Apache Beam中使用Pandas?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆