如何将pyspark dataframe列拆分为仅两列(下面的示例)? [英] How to split a pyspark dataframe column into only two columns (example below)?

查看:98
本文介绍了如何将pyspark dataframe列拆分为仅两列(下面的示例)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

该列在一行中多次使用定界符,因此 split 并不那么简单.
拆分后,在这种情况下,只需考虑出现第一个定界符.

The column has multiple usage of the delimiter in a single row, hence split is not as straightforward.
Upon splitting, only the 1st delimiter occurrence has to be considered in this case.

截至目前,我正在这样做.

As of now, I am doing this.

但是,我觉得有更好的解决方案?

testdf= spark.createDataFrame([("Dog", "meat,bread,milk"), ("Cat", "mouse,fish")],["Animal", "Food"])

testdf.show()

+------+---------------+
|Animal|           Food|
+------+---------------+
|   Dog|meat,bread,milk|
|   Cat|     mouse,fish|
+------+---------------+

testdf.withColumn("Food1", split(col("Food"), ",").getItem(0))\
        .withColumn("Food2",expr("regexp_replace(Food, Food1, '')"))\
        .withColumn("Food2",expr("substring(Food2, 2)")).show()

+------+---------------+-----+----------+
|Animal|           Food|Food1|     Food2|
+------+---------------+-----+----------+
|   Dog|meat,bread,milk| meat|bread,milk|
|   Cat|     mouse,fish|mouse|      fish|
+------+---------------+-----+----------+

推荐答案

使用正则表达式从列表中仅拆分首次出现的方法

An approach using regular expression to split only first occurrence from the list

testdf.withColumn('Food1',f.split('Food',"(?<=^[^,]*)\\,")[0]).\
       withColumn('Food2',f.split('Food',"(?<=^[^,]*)\\,")[1]).show()

+------+---------------+-----+----------+
|Animal|           Food|Food1|     Food2|
+------+---------------+-----+----------+
|   Dog|meat,bread,milk| meat|bread,milk|
|   Cat|     mouse,fish|mouse|      fish|
+------+---------------+-----+----------+

这篇关于如何将pyspark dataframe列拆分为仅两列(下面的示例)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆