咖喱UDF-Pyspark [英] Curried UDF - Pyspark

查看：82 发布时间：2020/9/4 19:46:28 python apache-spark pyspark apache-spark-sql user-defined-functions

本文介绍了咖喱UDF-Pyspark的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试在Spark中实现UDF；可以同时使用文字和列作为参数.为此，我相信我可以使用咖喱UDF.

I am trying to implement a UDF in spark; that can take both a literal and column as an argument. To achieve this, I believe I can use a curried UDF.

该函数用于将字符串文字与DataFrame列中的每个值匹配.我总结了以下代码:-

The function is used to match a string literal to each value in the column of a DataFrame. I have summarized the code below:-

def matching(match_string_1):
    def matching_inner(match_string_2):
        return difflib.SequenceMatcher(None, match_string_1, match_string_2).ratio()
    return matching

hc.udf.register("matching", matching)
matching_udf = F.udf(matching, StringType())

df_matched = df.withColumn("matching_score", matching_udf(lit("match_string"))(df.column))

"match_string"实际上是分配给我要遍历的列表的值.

"match_string" is actually a value assigned to a list which I am iterating over.

不幸的是，这没有像我希望的那样起作用.我正在收到

Unfortunately this is not working as I had hoped; and I am receiving

"TypeError:'列'对象不可调用".

"TypeError: 'Column' object is not callable".

我认为我没有正确调用此函数.

I believe I am not calling this function correctly.

推荐答案

应该是这样的:

def matching(match_string_1):
    def matching_inner(match_string_2):
        return difflib.SequenceMatcher(
            a=match_string_1, b=match_string_2).ratio()

    # Here create udf.
    return F.udf(matching_inner, StringType())

df.withColumn("matching_score", matching("match_string")(df.column))

如果要为match_string_1支持Column参数，则必须像这样重写它:

If you want to support Column argument for match_string_1 you'll have to rewrite it like this:

def matching(match_string_1):
    def matching_inner(match_string_2):
        return F.udf(
            lambda a, b: difflib.SequenceMatcher(a=a, b=b).ratio(),
            StringType())(match_string_1, match_string_2)

    return  matching_inner

df.withColumn("matching_score", matching(F.lit("match_string"))(df.column)

您当前的代码不起作用，matching_udf是并且UDF和matching_udf(lit("match_string"))创建了Column表达式而不是调用内部函数.

Your current code doesn't work, matching_udf is and UDF and matching_udf(lit("match_string")) creates a Column expression instead of calling internal function.

这篇关于咖喱UDF-Pyspark的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

咖喱UDF-Pyspark [英] Curried UDF - Pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

咖喱UDF-Pyspark [英] Curried UDF - Pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭