Apache Spark 遇到缺失功能时抛出 NullPointerException [英] Apache Spark throws NullPointerException when encountering missing feature

查看:34
本文介绍了Apache Spark 遇到缺失功能时抛出 NullPointerException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在对功能中的字符串列进行索引时,我在 PySpark 上遇到了一个奇怪的问题.这是我的 tmp.csv 文件:

I have a bizarre issue with PySpark when indexing column of strings in features. Here is my tmp.csv file:

x0,x1,x2,x3 
asd2s,1e1e,1.1,0
asd2s,1e1e,0.1,0
,1e3e,1.2,0
bd34t,1e1e,5.1,1
asd2s,1e3e,0.2,0
bd34t,1e2e,4.3,1

我有一个缺失的x0"值.首先,我使用 pyspark_csv 将 csv 文件中的特征读入 DataFrame:https://github.com/seahboonsiew/pyspark-csv然后用 StringIndexer 索引 x0:

where I have one missing value for 'x0'. At first, I'm reading features from csv file into DataFrame using pyspark_csv: https://github.com/seahboonsiew/pyspark-csv then indexing x0 with StringIndexer:

import pyspark_csv as pycsv
from pyspark.ml.feature import StringIndexer

sc.addPyFile('pyspark_csv.py')

features = pycsv.csvToDataFrame(sqlCtx, sc.textFile('tmp.csv'))
indexer = StringIndexer(inputCol='x0', outputCol='x0_idx' )
ind = indexer.fit(features).transform(features)
print ind.collect()

调用 ''ind.collect()'' 时 Spark 抛出 java.lang.NullPointerException.对于完整的数据集,一切正常,例如,对于x1".

when calling ''ind.collect()'' Spark throws java.lang.NullPointerException. Everything works fine for complete data set, e.g., for 'x1' though.

有没有人知道是什么导致了这个问题以及如何解决它?

Does anyone have a clue what is causing this and how to fix it?

提前致谢!

谢尔盖

更新:

我使用 Spark 1.5.1.确切的错误:

I use Spark 1.5.1. The exact error:

File "/spark/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py", line 258, in show
print(self._jdf.showString(n))

File "/spark/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__

File "/spark/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o444.showString.
: java.lang.NullPointerException
at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$hash(Metadata.scala:208)
at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$2.apply(Metadata.scala:196)
at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$2.apply(Metadata.scala:196)
... etc

我尝试在不读取 csv 文件的情况下创建相同的 DataFrame,

I've tried to create the same DataFrame without reading csv file,

df = sqlContext.createDataFrame(
  [('asd2s','1e1e',1.1,0), ('asd2s','1e1e',0.1,0), 
  (None,'1e3e',1.2,0), ('bd34t','1e1e',5.1,1), 
  ('asd2s','1e3e',0.2,0), ('bd34t','1e2e',4.3,1)],
  ['x0','x1','x2','x3'])

它给出了同样的错误.一个有点不同的例子工作正常,

and it gives the same error. A bit different example works fine,

df = sqlContext.createDataFrame(
  [(0, None, 1.2), (1, '06330986ed', 2.3), 
  (2, 'b7584c2d52', 2.5), (3, None, .8), 
  (4, 'bd17e19b3a', None), (5, '51b5c0f2af', 0.1)],
  ['id', 'x0', 'num'])

// after indexing x0

+---+----------+----+------+
| id|        x0| num|x0_idx|
+---+----------+----+------+
|  0|      null| 1.2|   0.0|
|  1|06330986ed| 2.3|   2.0|
|  2|b7584c2d52| 2.5|   4.0|
|  3|      null| 0.8|   0.0|
|  4|bd17e19b3a|null|   1.0|
|  5|51b5c0f2af| 0.1|   3.0|
+---+----------+----+------+

更新 2:

我刚刚在 Scala 中发现了同样的问题,所以我猜是 Spark 错误而不是 PySpark.特别是数据框

I've just discovered the same issue in Scala, so I guess it's Spark bug not PySpark only. In particular, data frame

val df = sqlContext.createDataFrame(
  Seq(("asd2s","1e1e",1.1,0), ("asd2s","1e1e",0.1,0), 
      (null,"1e3e",1.2,0), ("bd34t","1e1e",5.1,1), 
      ("asd2s","1e3e",0.2,0), ("bd34t","1e2e",4.3,1))
).toDF("x0","x1","x2","x3")

在索引x0"功能时抛出 java.lang.NullPointerException.此外,在以下数据框中索引x0"时

throws java.lang.NullPointerException when indexing 'x0' feature. Moreover, when indexing 'x0' in the following data frame

val df = sqlContext.createDataFrame(
  Seq((0, null, 1.2), (1, "b", 2.3), 
      (2, "c", 2.5), (3, "a", 0.8), 
      (4, "a", null), (5, "c", 0.1))
).toDF("id", "x0", "num")

我有java.lang.UnsupportedOperationException: Schema for type Any is not supported",这是由于第 5 个向量中缺少num"值造成的.如果用一个数字替换它,即使第一个向量中缺少值,一切都很好.

I've got 'java.lang.UnsupportedOperationException: Schema for type Any is not supported' which is caused by missing 'num' value in 5th vector. If one replaces it with a number everything works well even having missing value in the 1st vector.

我也试过旧版本的 Spark (1.4.1),结果是一样的.

I've also tried older versions of Spark (1.4.1), and the result is the same.

推荐答案

看起来您正在使用的模块将空字符串转换为空字符串,并且在某些时候会干扰下游处理.乍一看它看起来像一个 PySpark 错误.

It looks like module you're using converts empty strings to nulls and it is messing at some point with downstream processing. At first glance it looks like a PySpark bug.

如何解决?一个简单的解决方法是在索引之前删除空值:

How to fix it? A simple workaround is to either drop nulls before indexing:

features.na.drop()

或用一些占位符替换空值:

or replace nulls with some placeholder:

from pyspark.sql.functions import col, when

features.withColumn(
    "x0", when(col("x0").isNull(), "__SOME_PLACEHOLDER__").otherwise(col("x0")))

此外,您可以使用 spark-csv.它是高效的、经过测试的并且作为奖励不会将空字符串转换为 nulls.

features = (sqlContext.read
    .format('com.databricks.spark.csv')
    .option("inferSchema", "true")
    .option("header", "true")
    .load("tmp.csv"))

这篇关于Apache Spark 遇到缺失功能时抛出 NullPointerException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆