星火SQL:类型错误(" StructType不能接受的对象类型%S"%型(OBJ)) [英] Spark Sql: TypeError("StructType can not accept object in type %s" % type(obj))

查看:2155
本文介绍了星火SQL:类型错误(" StructType不能接受的对象类型%S"%型(OBJ))的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前使用PyODBC并试图插入在蜂巢的近实时(NRT)的方式表从SQL Server提取数据。

我从源单行,并转换成列表[字符串]和编程模式的创建,但在创建数据框,星火抛出StructType错误。

 >>> cnxn = pyodbc.connect(con_string)
>>> AJ = cnxn.cursor()
>>>
>>> aj.execute(选择tjob *)
<在0x257b2d0&GT pyodbc.Cursor对象;>>>行= aj.fetchone()>>>行
(1127年,U'',u'8196660'U''U'',0,U''U'',无,35岁,无,0,无,无,无,无,无,无,无,无,无,无,无,无,无,无,无,无,无,无,无,无,U',0,无,无)
>>> rowstr =地图(STR,行)
>>> rowstr
['1127','','8196660','','','0','','','无','35','无','0','无','无,无,无,无,无,无,无,无,无,无,无,无,无, 无,无,无,无,无,无','','0','无','无']>>> schemaString =。加入([在aj.columns row.column_name为行(表='tjob')])>>> schemaString
u'ID外部ID名称描述注释类型地块子批次ParentJobID的ProductID PlannedStartDateTime PlannedDurationSeconds Capture01 Capture02 Capture03 Capture04 Capture05 Capture06 Capture07 Capture08 Capture09 Capture10 Capture11 Capture12 Capture13 Capture14 Capture15 Capture16 Capture17 Capture18 Capture19 Capture20用户UserState ModifiedDateTime UploadedDateTime>>>域= [StructField(FIELD_NAME,StringType(),真)为FIELD_NAME在schemaString.split()]
>>>模式= StructType(场)>>> [f.dataType在schema.fields F]
[StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType ,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType,StringType]>>> myrdd = sc.parallelize(rowstr)>>> myrdd.collect()
['1127','','8196660','','','0','','','无','35','无','0','无','无,无,无,无,无,无,无,无,无,无,无,无,无, 无,无,无,无,无,无','','0','无','无']>>> schemaPeople = sqlContext.createDataFrame(myrdd,架构)
回溯(最近通话最后一个):
  文件<&标准输入GT;,1号线,上述<&模块GT;
  文件/apps/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark/python/pyspark/sql/context.py,线路404,在createDataFrame
    RDD,模式= self._createFromRDD(数据,架构,samplingRatio)
  文件/apps/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark/python/pyspark/sql/context.py,298线,在_createFromRDD
    _verify_type(行,模式)
  文件/apps/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark/python/pyspark/sql/types.py,线路1132,在_verify_type
    提高类型错误(StructType不能接受类型%s对象%类型(OBJ))
类型错误:StructType不能的类型&lt接受对象;键入'海峡'>


解决方案

下面是错误消息的原因:

 >>> rowstr
['1127','','8196660','','','0','','','无'...]
#rowstr是海峡名单>>> myrdd = sc.parallelize(rowstr)
#myrdd是海峡的RDD>>>模式= StructType(场)
#schema是StructType([StringType,StringType,...])>>> schemaPeople = sqlContext.createDataFrame(myrdd,架构)
#myrdd应该已经RDD([StringType,StringType,...]),但RDD(STR)

要解决这个问题,使正确类型的RDD:

 >>> myrdd = sc.parallelize([rowstr])

I am currently pulling data from SQL Server using PyODBC and trying to insert into a table in Hive in a Near Real Time (NRT) manner.

I got a single row from source and converted into List[Strings] and creating schema programatically but while creating a DataFrame, Spark is throwing StructType error.

>>> cnxn = pyodbc.connect(con_string)
>>> aj = cnxn.cursor()
>>>
>>> aj.execute("select * from tjob")
<pyodbc.Cursor object at 0x257b2d0>

>>> row = aj.fetchone()

>>> row
(1127, u'', u'8196660', u'', u'', 0, u'', u'', None, 35, None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, u'', 0, None, None)
>>> rowstr = map(str,row)
>>> rowstr
['1127', '', '8196660', '', '', '0', '', '', 'None', '35', 'None', '0', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', '', '0', 'None', 'None']

>>> schemaString = " ".join([row.column_name for row in aj.columns(table='tjob')])

>>> schemaString
u'ID ExternalID Name Description Notes Type Lot SubLot ParentJobID ProductID PlannedStartDateTime PlannedDurationSeconds Capture01 Capture02 Capture03 Capture04 Capture05 Capture06 Capture07 Capture08 Capture09 Capture10 Capture11 Capture12 Capture13 Capture14 Capture15 Capture16 Capture17 Capture18 Capture19 Capture20 User UserState ModifiedDateTime UploadedDateTime'

>>> fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
>>> schema = StructType(fields)

>>> [f.dataType for f in schema.fields]
[StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType]

>>> myrdd = sc.parallelize(rowstr)

>>> myrdd.collect()
['1127', '', '8196660', '', '', '0', '', '', 'None', '35', 'None', '0', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', '', '0', 'None', 'None']

>>> schemaPeople = sqlContext.createDataFrame(myrdd, schema)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/apps/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark/python/pyspark/sql/context.py", line 404, in createDataFrame
    rdd, schema = self._createFromRDD(data, schema, samplingRatio)
  File "/apps/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark/python/pyspark/sql/context.py", line 298, in _createFromRDD
    _verify_type(row, schema)
  File "/apps/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark/python/pyspark/sql/types.py", line 1132, in _verify_type
    raise TypeError("StructType can not accept object in type %s" % type(obj))
TypeError: StructType can not accept object in type <type 'str'>

解决方案

here is the reason for error message:

>>> rowstr
['1127', '', '8196660', '', '', '0', '', '', 'None' ... ]   
#rowstr is a list of str

>>> myrdd = sc.parallelize(rowstr)
#myrdd is a rdd of str

>>> schema = StructType(fields)
#schema is StructType([StringType, StringType, ....])

>>> schemaPeople = sqlContext.createDataFrame(myrdd, schema)
#myrdd should have been RDD([StringType, StringType,...]) but is RDD(str)

to fix that, make the RDD of proper type:

>>> myrdd = sc.parallelize([rowstr])

这篇关于星火SQL:类型错误(&QUOT; StructType不能接受的对象类型%S&QUOT;%型(OBJ))的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆