尝试在 Spark 中将 blob 转换为多列 [英] Trying to turn a blob into multiple columns in Spark
问题描述
我有一个序列化的 blob 和一个将其转换为 java Map 的函数.我已将该函数注册为 UDF 并尝试在 Spark SQL 中使用它,如下所示:
I have a serialized blob and a function that converts it into a java Map. I have registered the function as a UDF and tried to use it in Spark SQL as follows:
sqlCtx.udf.register("blobToMap", Utils.blobToMap)
val df = sqlCtx.sql(""" SELECT mp['c1'] as c1, mp['c2'] as c2 FROM
(SELECT *, blobToMap(payload) AS mp FROM t1) a """)
我确实成功地做到了,但由于某种原因,非常繁重的 blobToMap
函数对每一行运行两次,实际上我提取了 20 个字段,它对每一行运行了 20 次.我在 Derive multiple columns from aSpark DataFrame 中的单列但它们确实不可扩展 - 我不想每次需要提取数据时都创建一个类.
I do succeed in doing it, but for some reason the very heavy blobToMap
function runs twice for every row, and in reality I extract 20 fields and it runs 20 times for every row. I saw the suggestions in Derive multiple columns from a single column in a Spark DataFrame
but they are really not scalable - I don't want to create a class for every time I need to extract data.
如何强制 Spark 做合理的事情?我试图分成两个阶段.唯一有效的是缓存内部选择 - 但这也不可行,因为它确实是一个大 blob,我只需要其中的几十个字段.
How can I force Spark to do what's reasonable? I tried to separate to two stages. The only thing that worked was to cache the inner select - but that's not feasible either because it is really a big blob and I need only a few dozen fields from it.
推荐答案
我会回答自己希望它能帮助任何人..所以经过数十次实验后,我能够强制 spark 评估 udf 并将其转换为 Map一次,而不是为每个关键请求一遍又一遍地重新计算它,通过拆分查询并执行一个邪恶的丑陋技巧 - 将其转换为 RDD 并返回到 DataFrame:
I'll answer myself hoping it will help anyone.. so after dozens of experiments I was able to force spark to evaluate the udf and turn it into a Map once, instead of recalculating it over and over again for every key request, by splitting the query and doing an evil ugly trick - turning it ti RDD and back to DataFrame:
val df1 = sqlCtx.sql("SELECT *, blobToMap(payload) AS mp FROM t1")
sqlCtx.createDataFrame(df.rdd, df.schema).registerTempTable("t1_with_mp")
val final_df = sqlCtx.sql("SELECT mp['c1'] as c1, mp['c2'] as c2 FROM t1_with_mp")
这篇关于尝试在 Spark 中将 blob 转换为多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!