将Spark Dataframe字符串列拆分为多列 [英] Split Spark Dataframe string column into multiple columns

查看:1190
本文介绍了将Spark Dataframe字符串列拆分为多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我见过很多人建议Dataframe.explode是实现此目的的一种有用方法,但是它导致的行数比原始数据帧多,这根本不是我想要的.我只想做非常简单的Dataframe等效项:

I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:

rdd.map(lambda row: row + [row.my_str_col.split('-')])

它看起来像这样:

col1 | my_str_col
-----+-----------
  18 |  856-yygrm
 201 |  777-psgdg

并将其转换为此:

col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
  18 |  856-yygrm |   856 | yygrm
 201 |  777-psgdg |   777 | psgdg

我知道pyspark.sql.functions.split(),但是它导致嵌套的数组列而不是像我想要的两个顶级列.

I am aware of pyspark.sql.functions.split(), but it results in a nested array column instead of two top-level columns like I want.

理想情况下,我也希望这些新列也要命名.

Ideally, I want these new columns to be named as well.

推荐答案

pyspark.sql.functions.split()是正确的方法-您只需要将嵌套的ArrayType列展平为多个顶级列.在这种情况下,每个数组仅包含2个项目,这非常简单.您只需使用Column.getItem()即可将数组的每个部分作为列本身进行检索:

pyspark.sql.functions.split() is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself:

split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))

结果将是:

col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
  18 |  856-yygrm |   856 | yygrm
 201 |  777-psgdg |   777 | psgdg

我不确定在行与行之间嵌套数组的大小不相同的一般情况下如何解决此问题.

I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.

这篇关于将Spark Dataframe字符串列拆分为多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆