Pyspark:如何处理python用户定义函数中的空值 [英] Pyspark: How to deal with null values in python user defined functions

查看:112
本文介绍了Pyspark:如何处理python用户定义函数中的空值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用一些不是pyspark固有的字符串相似性函数,例如数据帧上的jaro和jaro-winkler度量.这些在python模块(例如 jellyfish )中很容易获得.在没有 null 值的情况下,即比较猫与狗,我可以写pyspark udf很好.当我将这些udf应用于存在 null 值的数据时,它不起作用.在诸如我要解决的问题中,字符串之一通常为 null

I want to use some string similarity functions that are not native to pyspark such as the jaro and jaro-winkler measures on dataframes. These are readily available in python modules such as jellyfish. I can write pyspark udf's fine for cases where there a no null values present, i.e. comparing cat to dog. when I apply these udf's to data where null values are present, it doesn't work. In problems such as the one I'm solving it is very common for one of the strings to be null

我需要帮助使我的字符串相似度udf正常工作,更具体地说,在其中一个值是 null

I need help getting my string similarity udf to work in general, to be more specific, to work in cases where one of the values are null

我写了一个udf,当输入数据中没有空值时,该udf可以工作:

I wrote a udf that works when there are no null values in the input data:

from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
import pyspark.sql.functions as F
import jellyfish.cjellyfish

def jaro_winkler_func(df, column_left, column_right):

    jaro_winkler_udf = udf(f=lambda s1, s2: jellyfish.jaro_winkler(s1, s2), returnType=DoubleType())

    df = (df
          .withColumn('test',
                      jaro_winkler_udf(df[column_left], df[column_right])))

    return df

示例输入和输出:

+-----------+------------+
|string_left|string_right|
+-----------+------------+
|       dude|         dud|
|       spud|         dud|
+-----------+------------+

+-----------+------------+------------------+
|string_left|string_right|              test|
+-----------+------------+------------------+
|       dude|         dud|0.9166666666666666|
|       spud|         dud|0.7222222222222222|
+-----------+------------+------------------+

当我对具有空值的数据运行此命令时,我会得到通常的火花错误,最适用的似乎是 TypeError:预期的str参数.我认为这是由于数据中的 null 值所致,因为它在没有数据时就起作用了.

When I run this on data that has a null value then I get the usual reams of spark errors, the most applicable one seems to be TypeError: str argument expected. I assume this one is due to null values in the data since it worked when there were none.

我修改了上面的函数,以检查两个值是否都不为null,只有在这种情况下才运行该函数,否则返回0.

I modified the function above to to check if both values are not null and only run the function if that's the case, otherwise return 0.

from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
import pyspark.sql.functions as F
import jellyfish.cjellyfish

def jaro_winkler_func(df, column_left, column_right):

    jaro_winkler_udf = udf(f=lambda s1, s2: jellyfish.jaro_winkler(s1, s2), returnType=DoubleType())

    df = (df
       .withColumn('test',
                   F.when(df[column_left].isNotNull() & df[column_right].isNotNull(),
                          jaro_winkler_udf(df[column_left], df[column_right]))
                   .otherwise(0.0)))

    return df

但是,我仍然遇到与以前相同的错误.

However, I still get the same errors as before.

样本输入以及我想要的输出是什么

Sample input and what I would like the output to be:

+-----------+------------+
|string_left|string_right|
+-----------+------------+
|       dude|         dud|
|       spud|         dud|
|       spud|        null|
|       null|        null|
+-----------+------------+

+-----------+------------+------------------+
|string_left|string_right|              test|
+-----------+------------+------------------+
|       dude|         dud|0.9166666666666666|
|       spud|         dud|0.7222222222222222|
|       spud|        null|0.0               |
|       null|        null|0.0               |
+-----------+------------+------------------+

推荐答案

我们将修改您的代码一点,它应该可以正常工作:

We will modify a little bit your code and it should works fine :

@udf(DoubleType())
def jaro_winkler(s1, s2):
    if not all((s1,s2)):
        out = 0
    else: 
        out = jellyfish.jaro_winkler(s1, s2)
    return out


def jaro_winkler_func(df, column_left, column_right):

    df = df.withColumn(
        'test',
        jaro_winkler(df[column_left], df[column_right]))
    )

    return df

这篇关于Pyspark:如何处理python用户定义函数中的空值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆