pyspark-java.lang.IllegalStateException:输入行没有架构所需的预期值数 [英] pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

查看:31
本文介绍了pyspark-java.lang.IllegalStateException:输入行没有架构所需的预期值数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Horton 沙箱上运行 pyspark-sql 代码

I'm running pyspark-sql code on Horton sandbox

18/08/11 17:02:22 信息 spark.SparkContext:运行 Spark 1.6.3 版

# code 
from pyspark.sql import *
from pyspark.sql.types import *
rdd1 = sc.textFile ("/user/maria_dev/spark_data/products.csv")
rdd2 = rdd1.map( lambda x : x.split("," ) )
df1 = sqlContext.createDataFrame(rdd2, ["id","cat_id","name","desc","price", "url"])
df1.printSchema()

root
 |-- id: string (nullable = true)
 |-- cat_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- desc: string (nullable = true)
 |-- price: string (nullable = true)
 |-- url: string (nullable = true)

df1.show() 
+---+------+--------------------+----+------+--------------------+
| id|cat_id|                name|desc| price|                 url|
+---+------+--------------------+----+------+--------------------+
|  1|     2|Quest Q64 10 FT. ...|    | 59.98|http://images.acm...|
|  2|     2|Under Armour Men'...|    |129.99|http://images.acm...|
|  3|     2|Under Armour Men'...|    | 89.99|http://images.acm...|
|  4|     2|Under Armour Men'...|    | 89.99|http://images.acm...|
|  5|     2|Riddell Youth Rev...|    |199.99|http://images.acm...|

# When I try to get counts I get the following error.
df1.count()

**Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 6 fields are required while 7 values are provided.**

# I get the same error for the following code as well
df1.registerTempTable("products_tab")
df_query = sqlContext.sql ("select id, name, desc from products_tab order by name, id ").show();

我看到列 desc 为空,不确定在创建数据框和使用任何方法时是否需要以不同的方式处理空列.

I see column desc is null, not sure if null column needs to be handled differently when creating data frame and using any method on it.

运行sql查询时出现同样的错误.似乎 sql 错误是由于order by"子句造成的,如果我删除 order by 然后查询运行成功.

The same error occurs when running sql query. It seems sql error is due to "order by" clause, if I remove order by then query runs successfully.

如果您需要更多信息并感谢有关如何处理此错误的答案,请告诉我.

Please let me know if you need more info and appreciate answer on how to handle this error.

我尝试查看名称字段是否包含任何逗号,正如 Chandan Ray 所建议的那样.名称字段中没有逗号.

I tried to see if name field contains any comma, as suggested by Chandan Ray. There's no comma in name field.

rdd1.count()
=> 1345
rdd2.count()
=> 1345
# clipping id and name column from rdd2
rdd_name = rdd2.map(lambda x: (x[0], x[2]) )
rdd_name.count()
=>1345
rdd_name_comma = rdd_name.filter (lambda x : True if x[1].find(",") != -1  else False )
rdd_name_comma.count()
==> 0

推荐答案

我发现了这个问题 - 这是由于一个错误的记录,其中逗号被嵌入到字符串中.即使字符串是双引号,python 也将字符串拆分为 2 列.我尝试使用 databricks 包

I found the issue- it was due to one bad record, where comma was embedded in string. And even though string was double quoted, python splits string into 2 columns. I tried using databricks package

# from command prompt
pyspark --packages com.databricks:spark-csv_2.10:1.4.0

# on pyspark 
 schema1 = StructType ([ StructField("id",IntegerType(), True), \
         StructField("cat_id",IntegerType(), True), \
         StructField("name",StringType(), True),\
         StructField("desc",StringType(), True),\
         StructField("price",DecimalType(), True), \
         StructField("url",StringType(), True)
         ])

df1 = sqlContext.read.format('com.databricks.spark.csv').schema(schema1).load('/user/maria_dev/spark_data/products.csv')
        df1.show()
df1.show()
    +---+------+--------------------+----+-----+--------------------+
    | id|cat_id|                name|desc|price|                 url|
    +---+------+--------------------+----+-----+--------------------+
    |  1|     2|Quest Q64 10 FT. ...|    |   60|http://images.acm...|
    |  2|     2|Under Armour Men'...|    |  130|http://images.acm...|
    |  3|     2|Under Armour Men'...|    |   90|http://images.acm...|
    |  4|     2|Under Armour Men'...|    |   90|http://images.acm...|
    |  5|     2|Riddell Youth Rev...|    |  200|http://images.acm...|

df1.printSchema()
    root
     |-- id: integer (nullable = true)
     |-- cat_id: integer (nullable = true)
     |-- name: string (nullable = true)
     |-- desc: string (nullable = true)
     |-- price: decimal(10,0) (nullable = true)
     |-- url: string (nullable = true)

df1.count()
     1345

这篇关于pyspark-java.lang.IllegalStateException:输入行没有架构所需的预期值数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆