如何将 pyspark 数据框列拆分为仅两列(以下示例)? [英] How to split a pyspark dataframe column into only two columns (example below)?
本文介绍了如何将 pyspark 数据框列拆分为仅两列(以下示例)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
该列在一行中多次使用分隔符,因此 split
并不那么简单.
在拆分时,在这种情况下只需要考虑 第一个分隔符 的出现.
The column has multiple usage of the delimiter in a single row, hence split
is not as straightforward.
Upon splitting, only the 1st delimiter occurrence has to be considered in this case.
截至目前,我正在这样做.
As of now, I am doing this.
但是,我觉得可以有更好的解决方案?
testdf= spark.createDataFrame([("Dog", "meat,bread,milk"), ("Cat", "mouse,fish")],["Animal", "Food"])
testdf.show()
+------+---------------+
|Animal| Food|
+------+---------------+
| Dog|meat,bread,milk|
| Cat| mouse,fish|
+------+---------------+
testdf.withColumn("Food1", split(col("Food"), ",").getItem(0))\
.withColumn("Food2",expr("regexp_replace(Food, Food1, '')"))\
.withColumn("Food2",expr("substring(Food2, 2)")).show()
+------+---------------+-----+----------+
|Animal| Food|Food1| Food2|
+------+---------------+-----+----------+
| Dog|meat,bread,milk| meat|bread,milk|
| Cat| mouse,fish|mouse| fish|
+------+---------------+-----+----------+
推荐答案
一种使用正则表达式从列表中只拆分第一次出现的方法
An approach using regular expression to split only first occurrence from the list
testdf.withColumn('Food1',f.split('Food',"(?<=^[^,]*)\\,")[0]).\
withColumn('Food2',f.split('Food',"(?<=^[^,]*)\\,")[1]).show()
+------+---------------+-----+----------+
|Animal| Food|Food1| Food2|
+------+---------------+-----+----------+
| Dog|meat,bread,milk| meat|bread,milk|
| Cat| mouse,fish|mouse| fish|
+------+---------------+-----+----------+
这篇关于如何将 pyspark 数据框列拆分为仅两列(以下示例)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文