如何使用带管道的多字符分隔符进行拆分? [英] How to split using multi-char separator with pipe?

查看:27
本文介绍了如何使用带管道的多字符分隔符进行拆分?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试根据分隔符:|:|:"在 spark 中拆分数据帧的字符串列

I am trying to split a string column of a dataframe in spark based on a delimiter ":|:|:"

Input:
TEST:|:|:51:|:|:PHT054008056

测试代码:

dataframe1
.withColumn("splitColumn", split(col("testcolumn"), ":|:|:"))

结果:

+------------------------------+
|splitColumn                   |
+------------------------------+
|[TEST, |, |, 51, |, |, P]     |   
+------------------------------+

测试代码:

dataframe1
.withColumn("part1", split(col("testcolumn"), ":|:|:").getItem(0))
.withColumn("part2", split(col("testcolumn"), ":|:|:").getItem(3))
.withColumn("part3", split(col("testcolumn"), ":|:|:").getItem(6))

part1 和 part2 工作正常.part3 只有 2 个字符,其余的字符串被截断.

part1 and part2 work correctly. part3 only has 2 characters and rest of the string is truncated.

第 3 部分:

P

我想获取整个 part3 字符串.任何帮助表示赞赏.

I want to get the entire part3 string. Any help is appreciated.

推荐答案

大功告成——只需要在分隔符中转义 | ,如下所示:

You're almost there – just need to escape | within your delimiter, as follows:

val df = Seq(
  (1, "TEST:|:|:51:|:|:PHT054008056"),
  (2, "TEST:|:|:52:|:|:PHT053007057")
).toDF("id", "testcolumn")

df.withColumn("part3", split($"testcolumn", ":\\|:\\|:").getItem(2)).show
// +---+--------------------+------------+
// | id|          testcolumn|       part3|
// +---+--------------------+------------+
// |  1|TEST:|:|:51:|:|:P...|PHT054008056|
// |  2|TEST:|:|:52:|:|:P...|PHT053007057|
// +---+--------------------+------------+

[更新]

您也可以使用三重引号作为分隔符,在这种情况下,您仍然需要转义 | 以表明它是一个文字管道(不是正则表达式中的 ):

You could also use triple quotes for the delimiter, in which case you still have to escape | to indicate it's a literal pipe (not or in Regex):

df.withColumn("part3", split($"testcolumn", """:\|:\|:""").getItem(2)).show

请注意,使用三引号,您只需要一个转义字符 \,而如果没有三引号,则转义字符本身需要转义(因此是 \\).

Note that with triple quotes, you need only a single escape character \, whereas without the triple quotes the escape character itself needs to be escaped (hence \\).

这篇关于如何使用带管道的多字符分隔符进行拆分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆