提取电话号码正则表达式 [英] Extract phone number regex

查看:85
本文介绍了提取电话号码正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何从文本文件中提取电话号码?

x <- c("憨豆先生买了两张票 2-613-213-4567 或 5555555555 拨打任意一张",43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567","请联系憨豆先生(613)2134567",1.575.555.5555 是他的第一号号码",7164347566")

这个问题已经在其他语言中得到了解答(参见 php abd 一般正则表达式),但似乎没有在 SO for R 上解决.>

我已经搜索并找到了似乎可以找到电话号码的正则表达式(除了上述其他语言的正则表达式):http://regexlib.com/Search.aspx?k=phone 但无法在 R 中使用 gsub 来提取所有示例中的这些数字.

理想情况下,我们会得到如下内容:

<代码>[[1]][1] "2-613-213-4567" "5555555555"[[2]][1]613 213 4567"[[3]][1](613)2134567"[[4]][1] "1.575.555.5555"[[5]][1]7164347566"

解决方案

这是我能做到的最好的 - 您有非常广泛的格式,包括一些带有空格的格式,因此正则表达式非常通用.它只是说查找至少 5 个字符的字符串,完全由数字、句点、括号、连字符或空格组成":

库(stringr)str_extract_all(x, "(^| )[0-9.() -]{5,}( |$)")

输出:

<代码>[[1]][1]2-613-213-4567"5555555555"[[2]][1] " 613 213 4567"[[3]][1]《(613)2134567》[[4]][1] "1.575.555.5555"[[5]][1]7164347566"

前导/尾随空格可能会修复一些额外的复杂性,或者您可以在后期修复它.

更新:一些搜索使我找到了这个答案,我对其稍作修改允许期间.在要求有效(美国?)电话号码方面更严格,但似乎涵盖了您的所有示例:

str_extract_all(x, "\\(?\\d{3}\\)?[.-]? *\\d{3}[.-]? *[.-]?\\d{4}")

输出:

<代码>[[1]][1] "613-213-4567" "5555555555"[[2]][1]613 213 4567"[[3]][1](613)2134567"[[4]][1] "575.555.5555"[[5]][1]7164347566"

此处找到的怪物也可以在取出^后使用$ 两端.仅在您确实需要时使用:

huge_regex = "(?:(?:\\+?1\\s*(?:[.-]\\s*)?)?(?:\\(\\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\\s*\\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\\s*(?:[.-]\\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\\s*(?:[.-]\\s*)?([0-9]{4})(?:\\s*(?:#|x\\.?|ext\\.?|extension)\\s*(\\d+))?"

How can I extract phone numbers from a text file?

x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one",
  "43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567", 
  "Please contact Mr. Bean (613)2134567",
  "1.575.555.5555 is his #1 number",  
  "7164347566"
)

This is a question that's been answered for other languages (see php abd general regex) but doesn't seem to have been tackled on SO for R.

I have searched and found what appears to be possible regexes to find phone numbers (In addition to the regexes from other languages above): http://regexlib.com/Search.aspx?k=phone but have not been able to use gsub within R with these to extract all of these numbers in the example.

Ideally, we'd get something like:

[[1]]
[1] "2-613-213-4567" "5555555555"    

[[2]]
[1] "613 213 4567"

[[3]]
[1] "(613)2134567"

[[4]]
[1] "1.575.555.5555"

[[5]]
[1] "7164347566"

解决方案

This is the best I've been able to do- you have a pretty wide range of formats, including some with spaces, so the regex is pretty general. It just says "look for a string of at least 5 characters made up entirely of digits, periods, brackets, hyphens or spaces":

library(stringr)
str_extract_all(x, "(^| )[0-9.() -]{5,}( |$)")

Output:

[[1]]
[1] " 2-613-213-4567 " " 5555555555 "    

[[2]]
[1] " 613 213 4567"

[[3]]
[1] " (613)2134567"

[[4]]
[1] "1.575.555.5555 "

[[5]]
[1] "7164347566"

The leading/trailing spaces could probably be fixed with some additional complexity, or you could just fix it in post.

Update: a bit of searching lead me to this answer, which I slightly modified to allow periods. A bit stricter in terms of requiring a valid (US?) phone number, but seems to cover all your examples:

str_extract_all(x, "\\(?\\d{3}\\)?[.-]? *\\d{3}[.-]? *[.-]?\\d{4}")

Output:

[[1]]
[1] "613-213-4567" "5555555555"  

[[2]]
[1] "613 213 4567"

[[3]]
[1] "(613)2134567"

[[4]]
[1] "575.555.5555"

[[5]]
[1] "7164347566"

The monstrosity found here also works once you take out the ^ and $ at either end. Use only if you really need it:

huge_regex = "(?:(?:\\+?1\\s*(?:[.-]\\s*)?)?(?:\\(\\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\\s*\\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\\s*(?:[.-]\\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\\s*(?:[.-]\\s*)?([0-9]{4})(?:\\s*(?:#|x\\.?|ext\\.?|extension)\\s*(\\d+))?"

这篇关于提取电话号码正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆