从杂音字符串中提取电话号码 [英] Extract phone number from noised string

查看:67
本文介绍了从杂音字符串中提取电话号码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在表中有一列,其中包含随机数据以及不同格式的电话号码.该列可能包含

I have a column in a table that contains random data along with phone numbers in different formats. The column may contain

  • 名称
  • 电话
  • 电子邮件
  • HTML标记
  • 地址(带有数字)

示例:

1) Call back from +79005346546, Conversation started<br>Phone: +79005346546<br>Called twice Came from google.com<br>IP: 77.106.46.202 the web page address is xxx.com utm_medium: cpc<br>utm_campaign: 32587871<br>utm_content: 5283041 79005346546 
2) John Smith
3) xxx@yyy.com
4) John Smith 8 999 888 77 77

电话号码的写法还取决于.可能像 892410100 22,8(927)410-00-22,+7(927)410-00-22,+7(927)410-00-22,(927)410 0022,927 410 00 22,9(2741)0 0 0-22

How a phone number is written is also depends. It may be like 8 927 410 00 22, 8(927)410-00-22, +7(927)410-00-22, +7 (927) 410-00-22, (927)410 00 22, 927 410 00 22, 9(2741) 0 0 0-22 and so on

这里的通用规则是电话号码格式包含10-11位数字.

The common rule here is that the phone number format contains 10-11 digits.

我最好的猜测是使用正则表达式,首先从字符串中删除电子邮件地址(因为它们可以在其中包含电话号码,例如79990001122@gmail.com),然后基于知道它是10或10,使用一些正则表达式提取电话行中的11位数字用,(,),+,-等字符分隔(我不认为有人会使用.作为电话数字分隔符,所以我们不不想在第一个示例中想到 77.106.46.202 这样的IP地址.

My best guess is to use regular expressions and firstly remove email addresses (since they can contain phone numbers in them like 79990001122@gmail.com) from the string and then use some regular expression to extract phone based on knowing it's 10 or 11 digits in row delimited with characters like ,(,),+,- and so on (I don't think someone would use . as phone digit delimiter so we don't want to think of IP Addresses like 77.106.46.202 in the first sample).

所以问题是如何从这些值中获取电话号码.

So the question is how to get phone numbers from these values.

我想从上面的三个例子中得到的最终值是:

The final values I want to get from the three examples above are:

1) 79005346546 79005346546 79005346546 
2) 
3) 
4) 89998887777

服务器为 Microsoft SQL Server 2014-12.0.2000.8(X64)标准版(64位)

推荐答案

更新(20200226)

有一些评论认为CLR/regex解决方案可能比我发布的ngram8k解决方案更快.我已经听了六年了,但是每一次,测试工具都无一例外地讲述了一个不同的故事.我已经在前面的注释说明中发布了使Microsoft©MDQ系列的CLR Regex在短短几分钟内即可运行的信息.它们是由Microsoft开发,测试和调整的,并随Master Data Services/Data Quality Services一起提供.我已经使用了很多年了,它们很好.

UPDATED (20200226)

There were a couple comments that a CLR/regex solution could be faster than the ngram8k solution I posted. I've heard this for six years but every single time, without exception, the test harness tells a different story. I already posted in the earlier comments instructions to get the Microsoft© MDQ family of CLR Regex running in just a few minutes. They were developed, tested and tuned by Microsoft and ship with Master Data Services/Data Quality Services. I've used them for years, they're good.

很明显,您不希望在 WHERE 子句中使用函数,但是由于Regex是生锈的AF,因此我需要这样做.为了公平起见,我在N-Gram解决方案的WHERE子句中对DigitsOnlyEE进行了相同的操作.

Obviously you don't want functions in your WHEREclause but, since my Regex is rusty AF, I needed to. To level the playing field I did the same with DigitsOnlyEE in the N-Gram solution's WHERE clause.

SET NOCOUNT ON;
DBCC FREEPROCCACHE    WITH NO_INFOMSGS;
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
SET STATISTICS TIME ON;

DECLARE
  @newData BIT            = 0,
  @string  VARCHAR(8000)  = '1) Call back from +79005346546, Conversation started<br>Phone: +79005346546<br>Called twice Came from google.com<br>IP: 77.106.46.202 the web page address is xxx.com utm_medium: cpc<br>utm_campaign: 32587871<br>utm_content: 5283041 79005346546 ',
  @pattern VARCHAR(50)    = '[^0-9()+.-]',
  @srchLen INT            = 11;

IF @newData = 1
BEGIN
  IF OBJECT_ID('tempdb..#strings','U') IS NOT NULL DROP TABLE #strings;

  SELECT 
    StringId = IDENTITY(INT,1,1),
    String   = REPLICATE(@string,ABS(CHECKSUM(NEWID())%3)+1)
  INTO   #strings
  FROM   dbo.rangeAB(1,1000000,1,1) AS r;
END

PRINT CHAR(10)+'Regex/CLR version Serial'+CHAR(10)+REPLICATE('-',90);
SELECT regex.NewString
FROM   #strings AS s
CROSS APPLY
(
  SELECT STRING_AGG(clr.RegexReplace(f.Token,'[^0-9]','',0),' ')
  FROM   clr.RegexSplit(s.string,@pattern,N'[0-9()+.-]',0) AS f
  WHERE  f.IsValid = 1
  AND    LEN(clr.RegexReplace(f.Token,'[^0-9]','',0)) = @srchLen
) AS regex(NewString);

PRINT CHAR(10)+'NGrams version Serial'+CHAR(10)+REPLICATE('-',90);
SELECT ngramsStuff.NewString
FROM   #strings AS s
CROSS APPLY
(
  SELECT      STRING_AGG(ee.digitsOnly,' ')
  FROM        samd.patExtract8K(@string,@pattern) AS pe
  CROSS APPLY samd.digitsOnlyEE(pe.item)          AS ee
  WHERE       LEN(ee.digitsOnly) = @srchLen
) AS ngramsStuff(NewString)
OPTION (MAXDOP 1);

SET STATISTICS TIME OFF;
GO

测试结果

Regex/CLR version Serial
------------------------------------------------------------------------------------------
 SQL Server Execution Times: CPU time = 19918 ms,  elapsed time = 12355 ms.

NGrams version Serial
------------------------------------------------------------------------------------------
 SQL Server Execution Times: CPU time = 844 ms,  elapsed time = 971 ms.

NGrams8k非常快,不需要您编译新的程序集,学习新的编程语言,启用CLR函数等.垃圾回收没有问题.甚至MDS/DQS附带的CLR N-GRAM功能也无法达到NGrams8k的性能(请参阅我文章下方的评论).

NGrams8k is very fast, does not require you to compile a new assembly, learn a new programming language, Enable CLR functions, etc... No issues with garbage collection. Even the CLR N-GRAMs function that ships with MDS/DQS can't touch NGrams8k for performance (see the comments under my article).

首先获取 ngrams8k的副本并使用它来构建PatExtract8k(本文底部的DDL.)接下来进行快速热身:

First grab a copy of ngrams8k and use it to build PatExtract8k (DDL below at the bottom of this post.) Next a quick warm-up:

DECLARE
  @string  VARCHAR(8000)  = 'Call me later at 222-3333 or tomorrow at 312.555.2222, 
                             (313)555-6789, or at 1+800-555-4444 before noon. Thanks!',
  @pattern VARCHAR(50)    = '%[^0-9()+.-]%';


SELECT pe.itemNumber, pe.itemIndex, pe.itemLength, pe.item
FROM   samd.patExtract8K(@string,@pattern) AS pe
WHERE  pe.itemLength > 1;

返回:

ItemNumber  ItemIndex   ItemLength  Item
----------- ----------- ----------- ----------------
1           18          8           222-3333
2           42          12          312.555.2222
3           91          13          (313)555-6789
4           112         14          1+800-555-4444

请注意,该函数返回匹配的模式,字符串中的位置,项目长度.可以利用前三个属性进行进一步处理,这将我们带到您的帖子中.注意我的评论:

Note that the function returns the matched pattern, position in the string, Item Length and the item. The first three attributes can be leveraged for further processing which brings us to your post. Note my comments:

-- First for some easily consumable sample data. 
DECLARE @things TABLE (StringId INT IDENTITY, String VARCHAR(8000));
INSERT @things (String)
VALUES
('Call back from +79005346546, Conversation started<br>Phone: +79005346546<br>Called twice Came from google.com<br>IP: 77.106.46.202 the web page address is xxx.com utm_medium: cpc<br>utm_campaign: 32587871<br>utm_content: 5283041 79005346546 '),
('John Smith'),
('xxx@yyy.com'),
('John Smith 8 999 888 77 77');

DECLARE @SrchLen INT = 11;

SELECT
  StringId   = t.StringId, 
  ItemIndex  = pe.itemIndex,
  ItemLength = @SrchLen,
  Item       = i2.Item
FROM        @things AS t
CROSS APPLY samd.patExtract8K(t.String,'[^0-9 ]')                        AS pe
CROSS APPLY (VALUES(PATINDEX('%'+REPLICATE('[0-9]',@SrchLen), pe.item))) AS i(Idx)
CROSS APPLY (VALUES(SUBSTRING(pe.Item,NULLIF(i.Idx,0),11)))              AS ns(NewString)
CROSS APPLY (VALUES(ISNULL(ns.NewString, REPLACE(pe.item,' ',''))))      AS i2(Item)
WHERE       pe.itemLength >= @SrchLen;

返回:

StringId    ItemIndex            ItemLength  Item
----------- -------------------- ----------- -----------
1           17                   11          79005346546
1           62                   11          79005346546
1           221                  11          79005346546
4           11                   11          89998887777

接下来,我们可以像这样处理外部行,并按如下方式处理行到列的连接:

Next we can handle outer rows like so and row-to-column concatenation like this:

WITH t AS
(
  SELECT      i2.Item, t.StringId
  FROM        @things AS t
  CROSS APPLY samd.patExtract8K(t.String,'[^0-9 ]')                        AS pe
  CROSS APPLY (VALUES(PATINDEX('%'+REPLICATE('[0-9]',@SrchLen), pe.item))) AS i(Idx)
  CROSS APPLY (VALUES(SUBSTRING(pe.Item,NULLIF(i.Idx,0),11)))              AS ns(NewString)
  CROSS APPLY (VALUES(ISNULL(ns.NewString, REPLACE(pe.item,' ',''))))      AS i2(Item)
  WHERE       pe.itemLength >= @SrchLen
)
SELECT 
  StringId  = t2.StringId,
  NewString = ISNULL((
    SELECT t.item+' '
    FROM   t
    WHERE  t.StringId = t2.StringId
    FOR XML PATH('')),'')
FROM      @things AS t2
LEFT JOIN t       AS t1 ON t2.StringId = t1.StringId
GROUP BY  t2.StringId;

返回:

StringId  NewString
--------- --------------------------------------
1         79005346546 79005346546 79005346546 
2         
3         
4         89998887777 

我希望我有更多时间了解其他细节,但这花了比原计划更长的时间.欢迎任何疑问.

I wish I had a little more time for additional details but this took a little longer then planned. Any questions welcome.

Patextract:

CREATE FUNCTION samd.patExtract8K
(
  @string  VARCHAR(8000),
  @pattern VARCHAR(50)
)
/*****************************************************************************************
[Description]:
 This can be considered a T-SQL inline table valued function (iTVF) equivalent of 
 Microsoft's mdq.RegexExtract except that:

 1. It includes each matching substring's position in the string

 2. It accepts varchar(8000) instead of nvarchar(4000) for the input string, varchar(50)
    instead of nvarchar(4000) for the pattern

 3. The mask parameter is not required and therefore does not exist.

 4. You have specify what text we're searching for as an exclusion; e.g. for numeric 
    characters you should search for '[^0-9]' instead of '[0-9]'. 

 5. There is is no parameter for naming a "capture group". Using the variable below, both 
    the following queries will return the same result:

     DECLARE @string nvarchar(4000) = N'123 Main Street';

   SELECT item FROM samd.patExtract8K(@string, '[^0-9]');
   SELECT clr.RegexExtract(@string, N'(?<number>(\d+))(?<street>(.*))', N'number', 1);

 Alternatively, you can think of patExtract8K as Chris Morris' PatternSplitCM (found here:
 http://www.sqlservercentral.com/articles/String+Manipulation/94365/) but only returns the
 rows where [matched]=0. The key benefit of is that it performs substantially better 
 because you are only returning the number of rows required instead of returning twice as
 many rows then filtering out half of them.  Furthermore, because we're 

 The following two sets of queries return the same result:

 DECLARE @string varchar(100) = 'xx123xx555xx999';
 BEGIN
 -- QUERY #1
 -- patExtract8K
   SELECT ps.itemNumber, ps.item 
   FROM samd.patExtract8K(@string, '[^0-9]') ps;

   -- patternSplitCM   
   SELECT itemNumber = row_number() over (order by ps.itemNumber), ps.item 
   FROM dbo.patternSplitCM(@string, '[^0-9]') ps
   WHERE [matched] = 0;

 -- QUERY #2
   SELECT ps.itemNumber, ps.item 
   FROM samd.patExtract8K(@string, '[0-9]') ps;

   SELECT itemNumber = row_number() over (order by itemNumber), item 
   FROM dbo.patternSplitCM(@string, '[0-9]')
   WHERE [matched] = 0;
 END;

[Compatibility]:
 SQL Server 2008+

[Syntax]:
--===== Autonomous
 SELECT pe.ItemNumber, pe.ItemIndex, pe.ItemLength, pe.Item
 FROM samd.patExtract8K(@string,@pattern) pe;

--===== Against a table using APPLY
 SELECT t.someString, pe.ItemIndex, pe.ItemLength, pe.Item
 FROM samd.SomeTable t
 CROSS APPLY samd.patExtract8K(t.someString, @pattern) pe;

[Parameters]:
 @string        = varchar(8000); the input string
 @searchString  = varchar(50); pattern to search for

[Returns]:
 itemNumber = bigint; the instance or ordinal position of the matched substring
 itemIndex  = bigint; the location of the matched substring inside the input string
 itemLength = int; the length of the matched substring
 item       = varchar(8000); the returned text

[Developer Notes]:
 1. Requires NGrams8k

 2. patExtract8K does not return any rows on NULL or empty strings. Consider using 
    OUTER APPLY or append the function with the code below to force the function to return 
    a row on emply or NULL inputs:

    UNION ALL SELECT 1, 0, NULL, @string WHERE nullif(@string,'') IS NULL;

 3. patExtract8K is not case sensitive; use a case sensitive collation for 
    case-sensitive comparisons

 4. patExtract8K is deterministic. For more about deterministic functions see:
    https://msdn.microsoft.com/en-us/library/ms178091.aspx

 5. patExtract8K performs substantially better with a parallel execution plan, often
    2-3 times faster. For queries that leverage patextract8K that are not getting a 
    parallel exeution plan you should consider performance testing using Traceflag 8649 
    in Development environments and Adam Machanic's make_parallel in production. 

[Examples]:
--===== (1) Basic extact all groups of numbers:
  WITH temp(id, txt) as
 (
   SELECT * FROM (values
   (1, 'hello 123 fff 1234567 and today;""o999999999 tester 44444444444444 done'),
   (2, 'syat 123 ff tyui( 1234567 and today 999999999 tester 777777 done'),
   (3, '&**OOOOO=+ + + // ==?76543// and today !!222222\\\tester{}))22222444 done'))t(x,xx)
 )
 SELECT
   [temp.id] = t.id,
   pe.itemNumber,
   pe.itemIndex,
   pe.itemLength,
   pe.item
 FROM        temp AS t
 CROSS APPLY samd.patExtract8K(t.txt, '[^0-9]') AS pe;
-----------------------------------------------------------------------------------------
Revision History:
 Rev 00 - 20170801 - Initial Development - Alan Burstein
 Rev 01 - 20180619 - Complete re-write   - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT itemNumber = ROW_NUMBER() OVER (ORDER BY f.position),
       itemIndex  = f.position,
       itemLength = itemLen.l,
       item       = SUBSTRING(f.token, 1, itemLen.l)
FROM
(
 SELECT ng.position, SUBSTRING(@string,ng.position,DATALENGTH(@string))
 FROM   samd.NGrams8k(@string, 1) AS ng
 WHERE  PATINDEX(@pattern, ng.token) <  --<< this token does NOT match the pattern
        ABS(SIGN(ng.position-1)-1) +    --<< are you the first row?  OR
        PATINDEX(@pattern,SUBSTRING(@string,ng.position-1,1)) --<< always 0 for 1st row
) AS f(position, token)
CROSS APPLY (VALUES(ISNULL(NULLIF(PATINDEX('%'+@pattern+'%',f.token),0),
  DATALENGTH(@string)+2-f.position)-1)) AS itemLen(l);
GO

这篇关于从杂音字符串中提取电话号码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆