如何使用多个 OR 运算符定义正则表达式,其中每个术语都包含空格前缀和后缀? [英] How to define a regular expression with multiple OR operators where each term includes a space prefix and suffix?

查看:26
本文介绍了如何使用多个 OR 运算符定义正则表达式,其中每个术语都包含空格前缀和后缀?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在准备数据提取任务.我需要删除一组术语;每个源记录字符串中都可能出现无、部分或全部.有超过 100,000 条目标记录.我想避免执行单个术语匹配/替换操作,因为 (a) 要删除的术语列表可能会增加,并且 (b) 一次一个术语执行当前匹配/替换操作的时间是不可接受.

I am preparing for a data extraction task. I need to remove a set of terms; none, some or all may be present in each source record string. There are over 100,000 target records. I want to avoid performing single term match/replace actions, since (a) the list of terms-to-be-removed will likely grow, and (b) the time to perform the current match/replace action one term at a time is unacceptable.

我的问题:如何修改正则表达式以在 OR 分隔列表中包含每个术语?

My question: how do I modify the regular expression to include each term within the OR separated list?

正则表达式

' and | and or | a o | company | co | c o | dba | d b a '

期望的行为

用一个空格替换每个找到的术语(包括前缀和后缀空格).

Replace each found term (including the prefix and suffix spaces) with a single space.

实际行为

找到的每个偶数"(相对于奇数")术语都用一个空格替换(包括前缀和后缀空格).

Every "even" (as opposed to "odd") term found is replaced (including the prefix and suffix spaces) with a single space.

示例

源字符串

'   MASHABLE LTD DBA THE INFORMATION EXPERTS and and or a o company co c o dba d b a COPYRIGHT  '

结果字符串(期望行为)

'   MASHABLE LTD THE INFORMATION EXPERTS COPYRIGHT  '

结果字符串(实际行为)

'   MASHABLE LTD THE INFORMATION EXPERTS and or company c o d b a COPYRIGHT  '

环境

SQL Server 2005

SQL Server 2005

依赖于 VBScript.RegExp 的用户定义函数 regexReplace(代码见文末)

User Defined Function regexReplace relying on VBScript.RegExp (code available at end of post)

代码

set nocount on

declare @source [varchar](800)
declare @regexp [varchar](400)
declare @replace [char](1)
declare @globalReplace [bit]
declare @ignoreCase [bit]
declare @result [varchar](800)

set @globalReplace = 1
set @ignoreCase = 1

SET @source = '   MASHABLE LTD DBA THE INFORMATION EXPERTS and and or a o company co c o dba d b a COPYRIGHT  '
set @regexp = ' and | and or | a o | company | co | c o | dba | d b a '
set @replace = ' '

select @result = master.dbo.regexReplace(@source,@regexp,@replace,@globalReplace,@ignoreCase)

print @result

...产生结果:

   MASHABLE LTD THE INFORMATION EXPERTS and or company c o d b a COPYRIGHT  

* dbo.regex 替换用户自定义函数定义 *

CREATE FUNCTION [dbo].[regexReplace]
    (
    @source varchar(5000),
    @regexp varchar(1000),
    @replace varchar(1000),
    @globalReplace bit = 0,
    @ignoreCase bit = 0
    )
    RETURNS varchar(1000) AS
        BEGIN
            DECLARE @hr integer
            DECLARE @objRegExp integer
            DECLARE @result varchar(5000)

            EXECUTE @hr = sp_OACreate 'VBScript.RegExp', @objRegExp OUTPUT
            IF @hr <> 0 
                BEGIN
                    EXEC @hr = sp_OADestroy @objRegExp
                    RETURN NULL
                END
            EXECUTE @hr = sp_OASetProperty @objRegExp, 'Pattern', @regexp
            IF @hr <> 0 
                BEGIN
                    EXEC @hr = sp_OADestroy @objRegExp
                    RETURN NULL
                END
            EXECUTE @hr = sp_OASetProperty @objRegExp, 'Global', @globalReplace
            IF @hr <> 0 
                BEGIN
                    EXEC @hr = sp_OADestroy @objRegExp
                    RETURN NULL
                END
            EXECUTE @hr = sp_OASetProperty @objRegExp, 'IgnoreCase', @ignoreCase
            IF @hr <> 0 
                BEGIN
                    EXEC @hr = sp_OADestroy @objRegExp
                    RETURN NULL
                END

            EXECUTE @hr = sp_OAMethod @objRegExp, 'Replace', @result OUTPUT, @source, @replace
            IF @hr <> 0 
                BEGIN
                    EXEC @hr = sp_OADestroy @objRegExp
                    RETURN NULL
                END

            EXECUTE @hr = sp_OADestroy @objRegExp
                IF @hr <> 0 
                    BEGIN
                        RETURN NULL
                    END

            RETURN @result
        END

推荐答案

试试这个:

(?: (?:and or|and|a o|company|co|c o|dba|d b a))+(?!\S)/i

像@mathematical.coffee 一样,我首先分解出前导空格并用前瞻替换尾随空格 - 在这种情况下,否定前瞻非空白 字符.这样,即使令牌是字符串中的最后一个并且后面没有空格,它也能工作.但最重要的变化是尽可能一次替换两个或多个匹配项.

Like @mathematical.coffee, I started by factoring out the leading space and replacing the trailing space with a lookahead--in this case, a negative lookahead for a non-whitespace character. This way it will work even if the token is the last one in the string and not followed by a space. But the most important change is replacing two or more matches at a time whenever possible.

这篇关于如何使用多个 OR 运算符定义正则表达式,其中每个术语都包含空格前缀和后缀?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆