从 Row 创建 DataFrame 会导致“推断架构问题" [英] Creating a DataFrame from Row results in 'infer schema issue'

查看:32
本文介绍了从 Row 创建 DataFrame 会导致“推断架构问题"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我开始学习 PySpark 时,我使用一个列表来创建一个 dataframe.现在从列表中推断模式已被弃用,我收到了一个警告,它建议我改用 pyspark.sql.Row.但是,当我尝试使用 Row 创建一个时,我得到了推断架构问题.这是我的代码:

When I began learning PySpark, I used a list to create a dataframe. Now that inferring the schema from list has been deprecated, I got a warning and it suggested me to use pyspark.sql.Row instead. However, when I try to create one using Row, I get infer schema issue. This is my code:

>>> row = Row(name='Severin', age=33)
>>> df = spark.createDataFrame(row)

这会导致以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/spark2-client/python/pyspark/sql/session.py", line 526, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/spark2-client/python/pyspark/sql/session.py", line 390, in _createFromLocal
    struct = self._inferSchemaFromList(data)
  File "/spark2-client/python/pyspark/sql/session.py", line 322, in _inferSchemaFromList
    schema = reduce(_merge_type, map(_infer_schema, data))
  File "/spark2-client/python/pyspark/sql/types.py", line 992, in _infer_schema
    raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <type 'int'>

所以我创建了一个架构

>>> schema = StructType([StructField('name', StringType()), 
...                      StructField('age',IntegerType())])
>>> df = spark.createDataFrame(row, schema)

但是,这个错误被抛出.

but then, this error gets thrown.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/spark2-client/python/pyspark/sql/session.py", line 526, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/spark2-client/python/pyspark/sql/session.py", line 387, in _createFromLocal
    data = list(data)
  File "/spark2-client/python/pyspark/sql/session.py", line 509, in prepare
    verify_func(obj, schema)
  File "/spark2-client/python/pyspark/sql/types.py", line 1366, in _verify_type
    raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj)))
TypeError: StructType can not accept object 33 in type <type 'int'>

推荐答案

createDataFrame 函数采用 行列表(以及其他选项)加上架构,因此正确的代码应该是这样的:

The createDataFrame function takes a list of Rows (among other options) plus the schema, so the correct code would be something like:

from pyspark.sql.types import *
from pyspark.sql import Row

schema = StructType([StructField('name', StringType()), StructField('age',IntegerType())])
rows = [Row(name='Severin', age=33), Row(name='John', age=48)]
df = spark.createDataFrame(rows, schema)

df.printSchema()
df.show()

出:

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

+-------+---+
|   name|age|
+-------+---+
|Severin| 33|
|   John| 48|
+-------+---+

在 pyspark 文档中 (link) 您可以找到有关 createDataFrame 函数的更多详细信息.

In the pyspark docs (link) you can find more details about the createDataFrame function.

这篇关于从 Row 创建 DataFrame 会导致“推断架构问题"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆