将功能应用于Spark DataFrame的每一行 [英] Apply function to each row of Spark DataFrame
问题描述
我正在使用Spark 1.3.
I'm on Spark 1.3.
我想将一个函数应用于数据框的每一行.此函数对行的每一列进行哈希处理,并返回哈希表.
I would like to apply a function to each row of a dataframe. This function hashes each column of the row and returns a list of the hashes.
dataframe.map(row => row.toSeq.map(col => col.hashCode))
运行此代码时,我收到NullPointerException.我认为这与 SPARK-5063 相关.
I get a NullPointerException when I run this code. I assume that this is related to SPARK-5063.
如果不使用嵌套地图,我想不出一种方法来达到相同的结果.
I can't think of a way to achieve the same result without using a nested map.
推荐答案
这不是SPARK-5063的实例,因为您没有嵌套RDD转换.内部.map()
应用于Scala Seq
,而不是RDD.
This isn't an instance of SPARK-5063 because you're not nesting RDD transformations; the inner .map()
is being applied to a Scala Seq
, not an RDD.
我的直觉是,数据集中的某些行包含空列值,因此当您尝试评估null.hashCode
时,某些col.hashCode
调用会引发NullPointerExceptions.为了解决此问题,在计算哈希码时,您需要考虑空值.
My hunch is that some rows in your data set contain null column values, so some of the col.hashCode
calls are throwing NullPointerExceptions when you try to evaluate null.hashCode
. In order to work around this, you need to take nulls into account when computing hashcodes.
如果您运行的是Java 7 JVM或更高版本(源),则可以这样做
If you're running on a Java 7 JVM or higher (source), you can do
import java.util.Objects
dataframe.map(row => row.toSeq.map(col => Objects.hashCode(col)))
或者,您可以在Java的早期版本中完成
Alternatively, on earlier versions of Java you can do
dataframe.map(row => row.toSeq.map(col => if (col == null) 0 else col.hashCode))
这篇关于将功能应用于Spark DataFrame的每一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!