将功能应用于Spark DataFrame的每一行 [英] Apply function to each row of Spark DataFrame

查看:353
本文介绍了将功能应用于Spark DataFrame的每一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Spark 1.3.

I'm on Spark 1.3.

我想将一个函数应用于数据框的每一行.此函数对行的每一列进行哈希处理,并返回哈希表.

I would like to apply a function to each row of a dataframe. This function hashes each column of the row and returns a list of the hashes.

dataframe.map(row => row.toSeq.map(col => col.hashCode))

运行此代码时,我收到NullPointerException.我认为这与 SPARK-5063 相关.

I get a NullPointerException when I run this code. I assume that this is related to SPARK-5063.

如果不使用嵌套地图,我想不出一种方法来达到相同的结果.

I can't think of a way to achieve the same result without using a nested map.

推荐答案

这不是SPARK-5063的实例,因为您没有嵌套RDD转换.内部.map()应用于Scala Seq,而不是RDD.

This isn't an instance of SPARK-5063 because you're not nesting RDD transformations; the inner .map() is being applied to a Scala Seq, not an RDD.

我的直觉是,数据集中的某些行包含空列值,因此当您尝试评估null.hashCode时,某些col.hashCode调用会引发NullPointerExceptions.为了解决此问题,在计算哈希码时,您需要考虑空值.

My hunch is that some rows in your data set contain null column values, so some of the col.hashCode calls are throwing NullPointerExceptions when you try to evaluate null.hashCode. In order to work around this, you need to take nulls into account when computing hashcodes.

如果您运行的是Java 7 JVM或更高版本(),则可以这样做

If you're running on a Java 7 JVM or higher (source), you can do

import java.util.Objects
dataframe.map(row => row.toSeq.map(col => Objects.hashCode(col)))

或者,您可以在Java的早期版本中完成

Alternatively, on earlier versions of Java you can do

    dataframe.map(row => row.toSeq.map(col => if (col == null) 0 else col.hashCode))

这篇关于将功能应用于Spark DataFrame的每一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆