spark sql-是否使用行转换或UDF [英] spark sql - whether to use row transformation or UDF

查看:144
本文介绍了spark sql-是否使用行转换或UDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含100列和1000万条记录的输入表(I).我想得到一个具有50列的输出表(O),这些列是从I的列派生的,即将有50个函数将I的列映射到O的50列,即 o1 = f( i1),o2 = f(i2,i3)...,o50 = f(i50,i60,i70).

I am having an input table (I) with 100 columns and 10 million records. I want to get an output table (O) that has 50 columns and these columns are derived from columns of I i.e. there will be 50 functions that map column(s) of I to 50 columns of O i.e. o1 = f(i1) , o2 = f(i2, i3) ..., o50 = f(i50, i60, i70).

在spark sql中,我可以通过两种方式执行此操作:

In spark sql I can do this in two ways:

  1. 行转换,其中逐行解析I的整个行(例如:映射函数)以生成O行.
  2. 使用我认为可以在列级别工作的UDF,即以I的现有列作为输入并产生O的相应列之一,即使用50个UDF函数.

鉴于我正在处理整个输入表I,我想知道上述2个中的哪一个更有效(更高的分布式和并行处理),以及为什么或为什么它们同样快速/高效并生成全新的输出表O,即其批量数据处理.

I want to know which one of the above 2 is more efficient (higher distributed and parallel processing) and why or if they are equally fast/performant, given that I am processing entire input table I and producing entirely new output table O i.e. its a bulk data processing.

推荐答案

我打算写有关

I was going to write this whole thing about the Catalyst optimizer, but it is simpler just to note what Jacek Laskowski says in his book Mastering Apache Spark 2:

"在恢复使用自己的自定义UDF函数之前,请尽可能在数据集运算符中使用更高级别的基于列的标准函数,因为UDF是Spark的黑盒,因此它甚至不尝试对其进行优化./em>"

"Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them."

Jacek还注意到Spark开发团队中某人的评论:

Jacek also notes a comment from someone on the Spark development team:

"在一些简单的情况下,我们可以分析UDF字节码并推断出它在做什么,但是通常很难做到."

这就是为什么Spark UDF永远不应该成为您的第一选择的原因.

This is why Spark UDFs should never be your first option.

此Cloudera post ,其中作者指出" ...使用Apache Spark的内置SQL查询功能通常会带来最佳性能,并且应该成为在避免引入UDF时应考虑的第一种方法."/em>"

That same sentiment is echoed in this Cloudera post, where the author states "...using Apache Spark’s built-in SQL query functions will often lead to the best performance and should be the first approach considered whenever introducing a UDF can be avoided."

但是,作者也正确地指出,随着Spark变得越来越聪明,这种情况将来可能会改变,与此同时,您可以使用Expression.genCode,如Chris Fregly的

However, the author correctly notes also that this may change in the future as Spark gets smarter, and in the meantime, you can use Expression.genCode, as described in Chris Fregly’s talk, if you don't mind tightly coupling to the Catalyst optimizer.

这篇关于spark sql-是否使用行转换或UDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆