Pyspark将标准列表转换为数据框 [英] Pyspark convert a standard list to data frame
问题描述
情况真的很简单,我需要使用以下代码将python列表转换为数据框
The case is really simple, I need to convert a python list into data frame with following code
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType, IntegerType
schema = StructType([StructField("value", IntegerType(), True)])
my_list = [1, 2, 3, 4]
rdd = sc.parallelize(my_list)
df = sqlContext.createDataFrame(rdd, schema)
df.show()
它失败,并出现以下错误:
it failed with following error:
raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj)))
TypeError: StructType can not accept object 1 in type <class 'int'>
推荐答案
该解决方案也是一种使用更少代码,避免序列化为RDD且可能更易于理解的方法:
This solution is also an approach that uses less code, avoids serialization to RDD and is likely easier to understand:
from pyspark.sql.types import IntegerType
# notice the variable name (more below)
mylist = [1, 2, 3, 4]
# notice the parens after the type name
spark.createDataFrame(mylist, IntegerType()).show()
注意:关于命名变量list
:术语list
是Python内置函数,因此,强烈建议避免使用内置名称作为变量的名称/标签,因为这样最终会覆盖list()
函数之类的东西.快速而肮脏的原型制作时,许多人都使用诸如mylist
.
NOTE: About naming your variable list
: the term list
is a Python builtin function and as such, it is strongly recommended that we avoid using builtin names as the name/label for our variables because we end up overwriting things like the list()
function. When prototyping something fast and dirty, a number of folks use something like: mylist
.
这篇关于Pyspark将标准列表转换为数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!