如何使用带有管道的多字符分隔符进行拆分? [英] How to split using multi-char separator with pipe?
问题描述
我正在尝试基于定界符:|:|:"在Spark中拆分数据帧的字符串列
I am trying to split a string column of a dataframe in spark based on a delimiter ":|:|:"
Input:
TEST:|:|:51:|:|:PHT054008056
测试代码:
dataframe1
.withColumn("splitColumn", split(col("testcolumn"), ":|:|:"))
结果:
+------------------------------+
|splitColumn |
+------------------------------+
|[TEST, |, |, 51, |, |, P] |
+------------------------------+
测试代码:
dataframe1
.withColumn("part1", split(col("testcolumn"), ":|:|:").getItem(0))
.withColumn("part2", split(col("testcolumn"), ":|:|:").getItem(3))
.withColumn("part3", split(col("testcolumn"), ":|:|:").getItem(6))
part1和part2正常工作. part3仅包含2个字符,其余字符串被截断.
part1 and part2 work correctly. part3 only has 2 characters and rest of the string is truncated.
第3部分:
P
我想获取整个part3字符串. 感谢您的帮助.
I want to get the entire part3 string. Any help is appreciated.
推荐答案
您快到了–只需在定界符内转义|
,如下所示:
You're almost there – just need to escape |
within your delimiter, as follows:
val df = Seq(
(1, "TEST:|:|:51:|:|:PHT054008056"),
(2, "TEST:|:|:52:|:|:PHT053007057")
).toDF("id", "testcolumn")
df.withColumn("part3", split($"testcolumn", ":\\|:\\|:").getItem(2)).show
// +---+--------------------+------------+
// | id| testcolumn| part3|
// +---+--------------------+------------+
// | 1|TEST:|:|:51:|:|:P...|PHT054008056|
// | 2|TEST:|:|:52:|:|:P...|PHT053007057|
// +---+--------------------+------------+
[更新]
您还可以在定界符中使用三引号,在这种情况下,您仍然必须转义|
来指示它是文字管道(在Regex中不是or
):
You could also use triple quotes for the delimiter, in which case you still have to escape |
to indicate it's a literal pipe (not or
in Regex):
df.withColumn("part3", split($"testcolumn", """:\|:\|:""").getItem(2)).show
请注意,使用三引号时,您只需要一个转义字符\
,而没有三引号时,则需要转义字符本身(因此\\
).
Note that with triple quotes, you need only a single escape character \
, whereas without the triple quotes the escape character itself needs to be escaped (hence \\
).
这篇关于如何使用带有管道的多字符分隔符进行拆分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!