在pyspark中创建一个包含一列元组的数据框 [英] Create a dataframe in pyspark that contains a single column of tuples
问题描述
我有一个包含以下[['column 1',value),('column 2',value),('column 3',value),...,('column 100',value )]. 我想创建一个包含带有元组的单列的数据框.
I have an RDD that contains the following [('column 1',value), ('column 2',value), ('column 3',value), ... , ('column 100',value)]. I want to create a dataframe that contains a single column with tuples.
我得到的最接近的是:
schema = StructType((StructField("char", StringType(), False), (StructField("count", IntegerType(), False))))
my_udf = udf(lambda w, c: (w,c), schema)
然后
df.select(my_udf('char', 'int').alias('char_int'))
但这会产生一个带有一列列表而不是元组的数据框.
but this produces a dataframe with a column of lists, not tuples.
推荐答案
struct
是在Spark SQL中表示产品类型(例如tuple
)的正确方法,而这正是使用代码所获得的:>
struct
is a s correct way to represent product types, like tuple
, in Spark SQL and this is exactly what you get using your code:
df = (sc.parallelize([("a", 1)]).toDF(["char", "int"])
.select(my_udf("char", "int").alias("pair")))
df.printSchema()
## root
## |-- pair: struct (nullable = true)
## | |-- char: string (nullable = false)
## | |-- count: integer (nullable = false)
除非您要创建UDT(在2.0.0中不再受支持)或将腌制的对象存储为BinaryType
,否则没有其他方法来表示元组.
There is no other way to represent a tuple unless you want to create an UDT (no longer supported in 2.0.0) or store pickled objects as BinaryType
.
此外,struct
字段在本地表示为tuple
:
Moreover struct
fields are locally represented as tuple
:
isinstance(df.first().pair, tuple)
## True
我想您打show
时可能会被方括号弄糊涂:
I guess you may be confused by square brackets when you call show
:
df.show()
## +-----+
## | pair|
## +-----+
## |[a,1]|
## +-----+
这只是JVM对应方呈现的选择的一种表示形式,并不表示Python类型.
which are simply a representation of choice render by JVM counterpart and don't indicate Python types.
这篇关于在pyspark中创建一个包含一列元组的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!