如何在apache Beam中使用Pandas? [英] How to use Pandas in apache beam?

查看:78
本文介绍了如何在apache Beam中使用Pandas?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在Apache Beam中实现Pandas? 我无法在多列上执行左联接,并且Pcollections不支持sql查询.甚至Apache Beam文档也没有正确地构建框架.我检查了一下,但是在Apache Beam中找不到任何一种Panda实现. 谁能将我定向到所需的链接?

How to implement Pandas in Apache beam ? I cannot perform left join on multiple columns and Pcollections does not support sql queries. Even the Apache Beam document is not properly framed. I checked but couldn't find any kind of Panda implementation in Apache beam. Can anyone direct me to the desired link ?

推荐答案

这里有些混乱.

pandas是受支持的",这意味着您可以使用pandas库的方式与不使用Apache Beam时使用的方式相同,也可以使用Beam管道中的任何其他库的方式只要您指定适当的依赖项即可.从某种意义上说,它也是受支持的",因为默认情况下它捆绑为一个依赖项,因此您不必自己指定它.例如,您可以编写一个DoFn,对每个元素使用pandas进行一些计算.每个元素的单独计算,由Beam在所有元素上并行执行.

pandas is "supported", in the sense that you can use the pandas library the same way you'd be using it without Apache Beam, and the same way you can use any other library from your Beam pipeline as long as you specify the proper dependencies. It is also "supported" in the sense that it is bundled as a dependency by default so you don't have to specify it yourself. For example, you can write a DoFn that performs some computation using pandas for every element; a separate computation for each element, performed by Beam in parallel over all elements.

不支持 ,因为Apache Beam当前不提供任何特殊集成,例如您不能将PCollection用作熊猫数据框,反之亦然. PCollection在物理上不包含任何数据(对于流传输管道,这应该特别清楚)-它只是Beam的执行计划中的占位符节点.

It is not supported in the sense that Apache Beam currently provides no special integration with it, e.g. you can't use a PCollection as a pandas dataframe, or vice versa. A PCollection does not physically contain any data (this should be particularly clear for streaming pipelines) - it is just a placeholder node in Beam's execution plan.

也就是说,与Beam PCollections一起使用的类似pandas的API肯定是一个好主意,并且可以简化许多现有pandas用户的学习Beam的工作,但是我认为没有人在从事目前正在实施.但是,Beam社区当前正在讨论向PCollections添加架构的想法,这是朝着这个方向迈出的一步.

That said, a pandas-like API for working with Beam PCollections would certainly be a good idea, and would simplify learning Beam for many existing pandas users, but I don't think anybody is working on implementing this currently. However, the Beam community is currently discussing the idea of adding schemas to PCollections, which is a step in this direction.

这篇关于如何在apache Beam中使用Pandas?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆