将 Spark Dataframe 字符串列拆分为多列 [英] Split Spark Dataframe string column into multiple columns
问题描述
我已经看到很多人建议 Dataframe.explode
是一种有用的方法来做到这一点,但它会产生比原始数据帧更多的行,这根本不是我想要的.我只是想做与非常简单的 Dataframe 等效的:
I've seen various people suggesting that Dataframe.explode
is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:
rdd.map(lambda row: row + [row.my_str_col.split('-')])
它看起来像:
col1 | my_str_col
-----+-----------
18 | 856-yygrm
201 | 777-psgdg
并将其转换为:
col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
我知道 pyspark.sql.functions.split()
,但它导致嵌套数组列而不是我想要的两个顶级列.
I am aware of pyspark.sql.functions.split()
, but it results in a nested array column instead of two top-level columns like I want.
理想情况下,我希望这些新列也被命名.
Ideally, I want these new columns to be named as well.
推荐答案
pyspark.sql.functions.split()
在这里是正确的方法 - 您只需要将嵌套的 ArrayType 列展平为多个顶级列.在这种情况下,每个数组只包含 2 个项目,这很容易.您只需使用 Column.getItem()
将数组的每个部分作为列本身来检索:
pyspark.sql.functions.split()
is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem()
to retrieve each part of the array as a column itself:
split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))
结果将是:
col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
在嵌套数组从行到行的大小不同的一般情况下,我不确定如何解决这个问题.
I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.
这篇关于将 Spark Dataframe 字符串列拆分为多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!