如何在数据帧中引用广播变量 [英] How to refer broadcast variable in dataframes
问题描述
我用的是spark1.6.我尝试广播 RDD,但不确定如何访问数据帧中的广播变量?
我有两个数据框员工 &部门.
员工数据框
-------------------员工 ID |员工姓名 |Emp_Age------------------1 |约翰 |252 |大卫 |35
部门数据框
--------------------部门 ID |部门名称 |员工编号-----------------------------1 |管理员 |12 |人力资源 |2导入 scala.collection.Mapval df_emp = hiveContext.sql("select * from emp")val df_dept = hiveContext.sql("select * from dept")val rdd = df_emp.rdd.map(row => (row.getInt(0),row.getString(1)))val lkp = rdd.collectAsMap()val bc = sc.broadcast(lkp)打印(bc.value.get(1).get)--以下语句不起作用val combineDF = df_dept.withColumn("emp_name",bc.value.get($"emp_id").get)
- 如何在上述组合DF语句中引用广播变量?
- 如果 lkp 没有返回任何值,如何处理?
- 有没有办法从 lkp 返回多条记录(假设在查找中 emp_id=1 有 2 条记录,我想获得两条记录)
- 如何从广播中返回多个值...(emp_name & emp_age)
如何在上面的combinedDF语句中引用广播变量?
使用udf
.如果 emp_id
是 Int
val f = udf((emp_id: Int) => bc.value.get(emp_id))df_dept.withColumn("emp_name", f($"emp_id"))
<块引用>
lkp没有返回值怎么处理?
不要使用如上所示的get
有没有办法从lkp返回多条记录
使用groupByKey
:
val lkp = rdd.groupByKey.collectAsMap()
和爆炸
:
df_dept.withColumn("emp_name", f($"emp_id")).withColumn("emp_name",explode($"emp_name"))
或者只是跳过所有步骤并广播
:
import org.apache.spark.sql.functions._df_emp.join(broadcast(df_dep), Seq("Emp Id"), "left")
I use spark1.6. I tried to broadcast a RDD and am not sure how to access the broadcasted variable in the data frames?
I have two dataframes employee & department.
Employee Dataframe
-------------------
Emp Id | Emp Name | Emp_Age
------------------
1 | john | 25
2 | David | 35
Department Dataframe
--------------------
Dept Id | Dept Name | Emp Id
-----------------------------
1 | Admin | 1
2 | HR | 2
import scala.collection.Map
val df_emp = hiveContext.sql("select * from emp")
val df_dept = hiveContext.sql("select * from dept")
val rdd = df_emp.rdd.map(row => (row.getInt(0),row.getString(1)))
val lkp = rdd.collectAsMap()
val bc = sc.broadcast(lkp)
print(bc.value.get(1).get)
--Below statement doesn't work
val combinedDF = df_dept.withColumn("emp_name",bc.value.get($"emp_id").get)
- How do I refer the broadcast variable in the above combinedDF statement?
- How to handle if the lkp doesn't return any value?
- Is there a way to return multiple records from the lkp (lets assume if there are 2 records for emp_id=1 in the look up, I would like to get both records)
- How to return more than one value from broadcast...(emp_name & emp_age)
How do I refer the broadcast variable in the above combinedDF statement?
Use udf
. If emp_id
is Int
val f = udf((emp_id: Int) => bc.value.get(emp_id))
df_dept.withColumn("emp_name", f($"emp_id"))
How to handle if the lkp doesn't return any value?
Don't use get
as shown above
Is there a way to return multiple records from the lkp
Use groupByKey
:
val lkp = rdd.groupByKey.collectAsMap()
and explode
:
df_dept.withColumn("emp_name", f($"emp_id")).withColumn("emp_name", explode($"emp_name"))
or just skip all the steps and broadcast
:
import org.apache.spark.sql.functions._
df_emp.join(broadcast(df_dep), Seq("Emp Id"), "left")
这篇关于如何在数据帧中引用广播变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!