如何使用多个 OR 运算符定义正则表达式,其中每个术语都包含空格前缀和后缀? [英] How to define a regular expression with multiple OR operators where each term includes a space prefix and suffix?
问题描述
我正在准备数据提取任务.我需要删除一组术语;每个源记录字符串中都可能出现无、部分或全部.有超过 100,000 条目标记录.我想避免执行单个术语匹配/替换操作,因为 (a) 要删除的术语列表可能会增加,并且 (b) 一次一个术语执行当前匹配/替换操作的时间是不可接受.
I am preparing for a data extraction task. I need to remove a set of terms; none, some or all may be present in each source record string. There are over 100,000 target records. I want to avoid performing single term match/replace actions, since (a) the list of terms-to-be-removed will likely grow, and (b) the time to perform the current match/replace action one term at a time is unacceptable.
我的问题:如何修改正则表达式以在 OR 分隔列表中包含每个术语?
My question: how do I modify the regular expression to include each term within the OR separated list?
正则表达式
' and | and or | a o | company | co | c o | dba | d b a '
期望的行为
用一个空格替换每个找到的术语(包括前缀和后缀空格).
Replace each found term (including the prefix and suffix spaces) with a single space.
实际行为
找到的每个偶数"(相对于奇数")术语都用一个空格替换(包括前缀和后缀空格).
Every "even" (as opposed to "odd") term found is replaced (including the prefix and suffix spaces) with a single space.
示例
源字符串
' MASHABLE LTD DBA THE INFORMATION EXPERTS and and or a o company co c o dba d b a COPYRIGHT '
结果字符串(期望行为)
' MASHABLE LTD THE INFORMATION EXPERTS COPYRIGHT '
结果字符串(实际行为)
' MASHABLE LTD THE INFORMATION EXPERTS and or company c o d b a COPYRIGHT '
环境
SQL Server 2005
SQL Server 2005
依赖于 VBScript.RegExp 的用户定义函数 regexReplace(代码见文末)
User Defined Function regexReplace relying on VBScript.RegExp (code available at end of post)
代码
set nocount on
declare @source [varchar](800)
declare @regexp [varchar](400)
declare @replace [char](1)
declare @globalReplace [bit]
declare @ignoreCase [bit]
declare @result [varchar](800)
set @globalReplace = 1
set @ignoreCase = 1
SET @source = ' MASHABLE LTD DBA THE INFORMATION EXPERTS and and or a o company co c o dba d b a COPYRIGHT '
set @regexp = ' and | and or | a o | company | co | c o | dba | d b a '
set @replace = ' '
select @result = master.dbo.regexReplace(@source,@regexp,@replace,@globalReplace,@ignoreCase)
print @result
...产生结果:
MASHABLE LTD THE INFORMATION EXPERTS and or company c o d b a COPYRIGHT
* dbo.regex 替换用户自定义函数定义 *
CREATE FUNCTION [dbo].[regexReplace]
(
@source varchar(5000),
@regexp varchar(1000),
@replace varchar(1000),
@globalReplace bit = 0,
@ignoreCase bit = 0
)
RETURNS varchar(1000) AS
BEGIN
DECLARE @hr integer
DECLARE @objRegExp integer
DECLARE @result varchar(5000)
EXECUTE @hr = sp_OACreate 'VBScript.RegExp', @objRegExp OUTPUT
IF @hr <> 0
BEGIN
EXEC @hr = sp_OADestroy @objRegExp
RETURN NULL
END
EXECUTE @hr = sp_OASetProperty @objRegExp, 'Pattern', @regexp
IF @hr <> 0
BEGIN
EXEC @hr = sp_OADestroy @objRegExp
RETURN NULL
END
EXECUTE @hr = sp_OASetProperty @objRegExp, 'Global', @globalReplace
IF @hr <> 0
BEGIN
EXEC @hr = sp_OADestroy @objRegExp
RETURN NULL
END
EXECUTE @hr = sp_OASetProperty @objRegExp, 'IgnoreCase', @ignoreCase
IF @hr <> 0
BEGIN
EXEC @hr = sp_OADestroy @objRegExp
RETURN NULL
END
EXECUTE @hr = sp_OAMethod @objRegExp, 'Replace', @result OUTPUT, @source, @replace
IF @hr <> 0
BEGIN
EXEC @hr = sp_OADestroy @objRegExp
RETURN NULL
END
EXECUTE @hr = sp_OADestroy @objRegExp
IF @hr <> 0
BEGIN
RETURN NULL
END
RETURN @result
END
推荐答案
试试这个:
(?: (?:and or|and|a o|company|co|c o|dba|d b a))+(?!\S)/i
像@mathematical.coffee 一样,我首先分解出前导空格并用前瞻替换尾随空格 - 在这种情况下,否定前瞻非空白 字符.这样,即使令牌是字符串中的最后一个并且后面没有空格,它也能工作.但最重要的变化是尽可能一次替换两个或多个匹配项.
Like @mathematical.coffee, I started by factoring out the leading space and replacing the trailing space with a lookahead--in this case, a negative lookahead for a non-whitespace character. This way it will work even if the token is the last one in the string and not followed by a space. But the most important change is replacing two or more matches at a time whenever possible.
这篇关于如何使用多个 OR 运算符定义正则表达式,其中每个术语都包含空格前缀和后缀?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!