regexp_extract是否适用于多种模式?-Spark SQL [英] Does regexp_extract work for multiple patterns?-Spark sql
问题描述
模式1:由|
Input : a|b|c|d
Output: a|b|c|d
由单个管道分隔时,选择所有内容
Pick everything when delimited by a single pipe
模式2:由|分隔和||
示例1:
Pattern 2:Delimited by | and ||
Example1:
Input :a|b||c||d
Output:a|b||c
在最后一个双管道之前选择所有内容
Pick everything before last double pipe
示例2:
Input :a|b||c|d
Output:a|b
模式3:字符串的开头可以有多个管道(奇数或偶数),并进一步由|分隔.和||
Input :|||a|b||c||d
Output:|||a|b||c
在最后一个双管道之前选择所有内容,字符串的开头可能具有奇数或偶数管道,因此必须选择它们.
Pick everything before last double pipe ,beginning of the string might have odd or even pipes and they must be selected.
以下内容涵盖了场景1之外的所有内容,我的要求是将所有场景都包含在一个 regexp_extract spark.sql("select regexp_extract('name | place | thing | ink','(.*)(?=\\\\ | \\\\\ |)')为演示").show(false)
Below covers all except scenario 1,My requirement is to cover all scenarios in one regexp_extract spark.sql("select regexp_extract('name|place|thing|ink', '(.*)(?=\\\\|\\\\|)') as demo").show(false)
如果不能一次通过regexp_extract完成.您能建议其他选择吗?
If it can not be done in one regexp_extract.Can you suggest other options.
请告知.
推荐答案
使用以下RegEx:
^(\|*(?:(?!\|\|(?!.*\|\|)).)*)
请参见 RegEx演示,其中显示了所有匹配项
See the RegEx Demo showing all the matches
这是一个相当复杂的要求,需要在回火模式中使用贪婪的令牌以及否定的超前使用.让我解释一下以下逻辑:
This is a rather complicated requirement and requires the use of Tempered Greedy Token together with Negative Lookahead within the Tempering pattern. Let me explain the logics below:
-
^
仅从字符串开头开始匹配 -
(...)
将整个模式括在^
之后,使其成为捕获组 -
\ | *
,要求模式3开头必须与多个|
匹配(因此请使用贪婪的*
) -
(?:(?!...).)*
这是脾气贪婪令牌的主要构造(骨架),我将在下面解释其详细信息: -
\ | \ |(?!.* \ | \ |)
,这是脾气暴躁令牌的主体(核心).(
)的第一部分是确保字符匹配,但不包括模式||
.第二部分(?!.* \ | \ |)
是为了确保按照要求,在第一部分的|||
模式之后没有其他任何双管道|||
.
^
to match only from the beginning of string(...)
enclose the entire pattern after^
to make it a capturing group\|*
for the requirement of Pattern 3 to match the multiple|
at the beginning, as many as possible (hence use greedy*
)(?:(?!...).)*
this is the main construct (skeleton) of Tempered Greedy Token whose details I will explain below:\|\|(?!.*\|\|)
this is the main body (core) of the Tempered Greedy Token. The first part before(
is to ensure the characters match up to but not including the pattern||
The second part(?!.*\|\|)
is to ensure the||
pattern in the first part is not followed by any other double pipes||
somewhere after, as per the requirement.
事实上,我认为这个问题很有趣,并且需要RegEx的复杂功能来支持它.这也是我到目前为止看到的第一个示例,该示例要求在脾气暴躁的令牌构造中使用负前瞻.
In fact, I think the question is quite interesting and requires sophisticated features of RegEx to support it. This is also the first example I seen so far that requires a Negative Lookahead within a Tempered Greedy Token construct.
这篇关于regexp_extract是否适用于多种模式?-Spark SQL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!