通过正则表达式捕获组拆分火花数据帧列中的字符串 [英] Split string in a spark dataframe column by regular expressions capturing groups

查看:20
本文介绍了通过正则表达式捕获组拆分火花数据帧列中的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于下面的数据框,我想将数字列拆分为数组中每个原始数字元素的 3 个字符的数组

Given the below data frame, i wanted to split the numbers column into an array of 3 characters per element of the original number in the array

给定的数据框:

+---+------------------+
| id|           numbers|
+---+------------------+
|742|         000000000|
|744|            000000|
|746|003000000000000000|
+---+------------------+

预期数据帧:

+---+----------------------------------+
| id|           numbers                |
+---+----------------------------------+
|742| [000, 000, 000]                  |
|744| [000, 000]                       |
|746| [003, 000, 000, 000, 000, 000]   |
+---+----------------------------------+

我尝试了不同的正则表达式,同时使用下面给出的 split 函数和我认为应该在第一次尝试时就可以使用的正则表达式:

I tried different regular expressions while using the split function given below the with the regex that I felt should have worked on the very first try:

import pyspark.sql.functions as f

df = spark.createDataFrame(
    [
        [742, '000000000'], 
        [744, '000000'], 
        [746, '003000000000000000'], 
    ],
    ["id", "numbers"]
)

df = df.withColumn("numbers", f.split("numbers", "[0-9]{3}"))

df.show()

结果是

+---+--------------+
| id|       numbers|
+---+--------------+
|742|      [, , , ]|
|744|        [, , ]|
|746|[, , , , , , ]|
+---+--------------+

我想了解我做错了什么.是否有可能设置全局标志以获取所有匹配项,或者我是否完全错过了正则表达式中的某些内容?

I want to understand what I am doing wrong. Is there a possibility of setting the global flag for getting all the matches or have I missed something in the regular expression altogether?

推荐答案

以下是不使用 udf 的方法:

Here's how you can do this without using a udf:

df = df.withColumn(
    "numbers",
    f.split(f.regexp_replace("numbers", "([0-9]{3})(?!$)", r"$1,"), ",")
)

df.show(truncate=False)
#+---+------------------------------+
#|id |numbers                       |
#+---+------------------------------+
#|742|[000, 000, 000]               |
#|744|[000, 000]                    |
#|746|[003, 000, 000, 000, 000, 000]|
#+---+------------------------------+

首先使用pyspark.sql.functions.regexp_replace 用后跟逗号的序列替换 3 位数字的序列.然后用逗号分割结果字符串.

First use pyspark.sql.functions.regexp_replace to replace sequences of 3 digits with the sequence followed by a comma. Then split the resulting string on a comma.

替换模式 "$1," 表示第一个捕获组,后跟一个逗号.

The replacement pattern "$1," means first capturing group, followed by a comma.

在匹配模式中,我们还为字符串结尾添加了一个否定前瞻,(?!$),以避免在字符串结尾添加逗号.

In the match pattern, we also include a negative lookahead for end of string, (?!$), to avoid adding a comma to the end of the string.

参考:REGEXP_REPLACE 捕获组

这篇关于通过正则表达式捕获组拆分火花数据帧列中的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆