spark sql-是否使用行转换或UDF [英] spark sql - whether to use row transformation or UDF
问题描述
我有一个包含100列和1000万条记录的输入表(I).我想得到一个具有50列的输出表(O),这些列是从I的列派生的,即将有50个函数将I的列映射到O的50列,即 o1 = f( i1),o2 = f(i2,i3)...,o50 = f(i50,i60,i70).
I am having an input table (I) with 100 columns and 10 million records. I want to get an output table (O) that has 50 columns and these columns are derived from columns of I i.e. there will be 50 functions that map column(s) of I to 50 columns of O i.e. o1 = f(i1) , o2 = f(i2, i3) ..., o50 = f(i50, i60, i70).
在spark sql中,我可以通过两种方式执行此操作:
In spark sql I can do this in two ways:
- 行转换,其中逐行解析I的整个行(例如:映射函数)以生成O行.
- 使用我认为可以在列级别工作的UDF,即以I的现有列作为输入并产生O的相应列之一,即使用50个UDF函数.
鉴于我正在处理整个输入表I,我想知道上述2个中的哪一个更有效(更高的分布式和并行处理),以及为什么或为什么它们同样快速/高效并生成全新的输出表O,即其批量数据处理.
I want to know which one of the above 2 is more efficient (higher distributed and parallel processing) and why or if they are equally fast/performant, given that I am processing entire input table I and producing entirely new output table O i.e. its a bulk data processing.
推荐答案
I was going to write this whole thing about the Catalyst optimizer, but it is simpler just to note what Jacek Laskowski says in his book Mastering Apache Spark 2:
"在恢复使用自己的自定义UDF函数之前,请尽可能在数据集运算符中使用更高级别的基于列的标准函数,因为UDF是Spark的黑盒,因此它甚至不尝试对其进行优化./em>"
"Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them."
Jacek还注意到Spark开发团队中某人的评论:
Jacek also notes a comment from someone on the Spark development team:
"在一些简单的情况下,我们可以分析UDF字节码并推断出它在做什么,但是通常很难做到."
这就是为什么Spark UDF永远不应该成为您的第一选择的原因.
This is why Spark UDFs should never be your first option.
此Cloudera post ,其中作者指出" ...使用Apache Spark的内置SQL查询功能通常会带来最佳性能,并且应该成为在避免引入UDF时应考虑的第一种方法."/em>"
That same sentiment is echoed in this Cloudera post, where the author states "...using Apache Spark’s built-in SQL query functions will often lead to the best performance and should be the first approach considered whenever introducing a UDF can be avoided."
但是,作者也正确地指出,随着Spark变得越来越聪明,这种情况将来可能会改变,与此同时,您可以使用Expression.genCode
,如Chris Fregly的
However, the author correctly notes also that this may change in the future as Spark gets smarter, and in the meantime, you can use Expression.genCode
, as described in Chris Fregly’s talk, if you don't mind tightly coupling to the Catalyst optimizer.
这篇关于spark sql-是否使用行转换或UDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!