在PySpark中将StringType转换为ArrayType [英] Convert StringType to ArrayType in PySpark

查看：98 发布时间：2020/10/16 22:52:28 python apache-spark dataframe pyspark rdd

本文介绍了在PySpark中将StringType转换为ArrayType的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试在数据集上的PySpark中运行FPGrowth算法。

I am trying to Run the FPGrowth algorithm in PySpark on my Dataset.

from pyspark.ml.fpm import FPGrowth

fpGrowth = FPGrowth(itemsCol="name", minSupport=0.5,minConfidence=0.6) 
model = fpGrowth.fit(df)

我遇到以下错误：

An error occurred while calling o2139.fit.
: java.lang.IllegalArgumentException: requirement failed: The input 
column must be ArrayType, but got StringType.
at scala.Predef$.require(Predef.scala:224)

我的数据框df格式为：

My Dataframe df is in the form:

df.show(2)

+---+---------+--------------------+
| id|     name|               actor|
+---+---------+--------------------+
|  0|['ab,df']|                 tom|
|  1|['rs,ce']|                brad|
+---+---------+--------------------+
only showing top 2 rows

如果我的名称列中的数据采用以下格式，则FP算法将起作用：

The FP algorithm works if my data in column "name" is in the form:

 name
[ab,df]
[rs,ce]

如何以这种形式将其从StringType转换为ArrayType

How do I get it in this form that is convert from StringType to ArrayType

来自我的RDD的数据框：

I formed the Dataframe from my RDD:

rd2=rd.map(lambda x: (x[1], x[0][0] , [x[0][1]]))

rd3 = rd2.map(lambda p:Row(id=int(p[0]),name=str(p[2]),actor=str(p[1])))
df = spark.createDataFrame(rd3)

rd2.take(2):

[(0, 'tom', ['ab,df']), (1, 'brad', ['rs,ce'])]

推荐答案

为数据框的 name 列中的每一行用逗号分隔。 eg

Split by comma for each row in the name column of your dataframe. e.g.

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('list', PandasUDFType.SCALAR)
def split_comma(v):
    return v[1:-1].split(',')

df.withColumn('name', split_comma(df.name))

或者更好，不要推迟。将名称直接设置到列表中。

Or better, don't defer this. Set name directly to the list.

rd2 = rd.map(lambda x: (x[1], x[0][0], x[0][1].split(',')))
rd3 = rd2.map(lambda p:Row(id=int(p[0]), name=p[2], actor=str(p[1])))

这篇关于在PySpark中将StringType转换为ArrayType的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在PySpark中将StringType转换为ArrayType [英] Convert StringType to ArrayType in PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在PySpark中将StringType转换为ArrayType [英] Convert StringType to ArrayType in PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭