在另一个数据框的UDF中时如何引用数据框? [英] How to reference a dataframe when in an UDF on another dataframe?

查看：71 发布时间：2020/9/4 3:35:20 apache-spark dataframe pyspark user-defined-functions broadcast

本文介绍了在另一个数据框的UDF中时如何引用数据框?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在另一个数据帧上执行UDF时，如何引用pyspark数据帧?

How do you reference a pyspark dataframe when in the execution of an UDF on another dataframe?

这是一个虚拟的例子.我正在创建两个数据帧scores和lastnames，并且每个数据帧中都有一列，这两个数据帧之间是相同的.在应用于scores的UDF中，我要在lastnames上进行过滤，并返回在lastname中找到的字符串.

Here's a dummy example. I am creating two dataframes scores and lastnames, and within each lies a column that is the same across the two dataframes. In the UDF applied on scores, I want to filter on lastnames and return a string found in lastname.

from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.types import *

sc = SparkContext("local")
sqlCtx = SQLContext(sc)


# Generate Random Data
import itertools
import random
student_ids = ['student1', 'student2', 'student3']
subjects = ['Math', 'Biology', 'Chemistry', 'Physics']
random.seed(1)
data = []

for (student_id, subject) in itertools.product(student_ids, subjects):
    data.append((student_id, subject, random.randint(0, 100)))

from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
            StructField("student_id", StringType(), nullable=False),
            StructField("subject", StringType(), nullable=False),
            StructField("score", IntegerType(), nullable=False)
    ])

# Create DataFrame 
rdd = sc.parallelize(data)
scores = sqlCtx.createDataFrame(rdd, schema)

# create another dataframe
last_name = ["Granger", "Weasley", "Potter"]
data2 = []
for i in range(len(student_ids)):
    data2.append((student_ids[i], last_name[i]))

schema = StructType([
            StructField("student_id", StringType(), nullable=False),
            StructField("last_name", StringType(), nullable=False)
    ])

rdd = sc.parallelize(data2)
lastnames = sqlCtx.createDataFrame(rdd, schema)


scores.show()
lastnames.show()


from pyspark.sql.functions import udf
def getLastName(sid):
    tmp_df = lastnames.filter(lastnames.student_id == sid)
    return tmp_df.last_name

getLastName_udf = udf(getLastName, StringType())
scores.withColumn("last_name", getLastName_udf("student_id")).show(10)

以下是跟踪的最后一部分:

And the following is the last part of the trace:

Py4JError: An error occurred while calling o114.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
    at py4j.Gateway.invoke(Gateway.java:252)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

推荐答案

将对更改为字典以方便查找名称

Changing pair to dictionary for easy lookup of names

data2 = {}
for i in range(len(student_ids)):
    data2[student_ids[i]] = last_name[i]

不是创建rdd并将其设置为df而是创建广播变量

Instead of creating rdd and making it to df create broadcast variable

//rdd = sc.parallelize(data2) 
//lastnames = sqlCtx.createDataFrame(rdd, schema)
lastnames = sc.broadcast(data2)

现在使用广播变量(lastnames)上的values attr在udf中访问此文件.

Now access this in udf with values attr on broadcast variable(lastnames).

from pyspark.sql.functions import udf
def getLastName(sid):
    return lastnames.value[sid]

这篇关于在另一个数据框的UDF中时如何引用数据框?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在另一个数据框的UDF中时如何引用数据框? [英] How to reference a dataframe when in an UDF on another dataframe?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在另一个数据框的UDF中时如何引用数据框? [英] How to reference a dataframe when in an UDF on another dataframe?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭