如何使用 R 或命令行提取与文本文件中的电子邮件地址匹配的表达式? [英] How to extract expression matching an email address in a text file using R or Command Line?

查看:22
本文介绍了如何使用 R 或命令行提取与文本文件中的电子邮件地址匹配的表达式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含电子邮件地址和一些信息的文本文件.

I have a text file that contains email addresses and some information.

我想知道如何使用 R 或终端提取这些电子邮件地址?

I would like to know how can I extract those email address using R or the terminal?

我已经读到我可以使用一些匹配电子邮件地址的正则表达式,例如

I've read that I can used some regular expression that would match an email address such as

"^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$" 

但是我应该使用什么命令或函数来提取这些电子邮件?

But what command or function shall I used to extract those emails?

文本文件中没有模式.命令或函数应该只是对文档进行搜索并提取电子邮件地址.

There are no pattern in the text file. The command or function should just do a search on the document and extract the email addresses.

推荐答案

让我们来看一个非结构化的示例文件:

Lets take an unstructured example file:

this is a test

fred is fred@foo.com and joe is joe@example.com - but
 @this is a twitter handle for twit@here.com

如果你这样做:

myText <- readLines("testmail.txt")
emails = unlist(regmatches(myText, gregexpr("([_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))", myText)))
> emails
[1] "fred@foo.com"    "joe@example.com" "twit@here.com"  

它提取所有电子邮件的向量,包括当一行中有多个电子邮件时.我不认为它会发现电子邮件地址被换行符破坏,但如果您将阅读的行粘贴在一起,它也可能会这样做:

it extracts a vector of all the emails, including when there's more than one on a line. I don't think it will find email addresses broken over line breaks, but if you paste the read lines together it might do that too:

> myText = paste(readLines("testmail.txt"),collapse=" ")
> emails = regmatches(myText, gregexpr("([_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))", myText))
> emails
[[1]]
[1] "fred@foo.com"    "joe@example.com" "twit@here.com"  

在这种情况下,myText 中只有一行,因为我们将所有行粘贴在一起,因此返回的列表 emails 对象中只有一个元素.

In this case there's only one line in myText because we pasted all the lines together, so there's only one element in the returned list emails object.

请注意,正则表达式字符串不是有效电子邮件地址的严格定义.例如,它将自己限制在最后一个点之后 2 到 4 个字符之间的地址.所以它不匹配 fred@foo.fnord.顶级域的长度超过四个字符,因此您可能需要修改正则表达式.

Note that regex string isn't a strict definition of a valid email address. For example, it limits itself to addresses that are between 2 and 4 characters after the last dot. So it doesn't match fred@foo.fnord. There are top level domains that are longer than four characters so you may need to modify the regex.

此外,它仅匹配名称部分中的字母数字和点 - 因此诸如 foo+bar@google.com 之类的有效地址将不匹配.

Also, it only matches alphanumeric and dot in the name part - so valid addresses such as foo+bar@google.com won't match.

解决这两个问题的正则表达式可能是:

A regex that fixes these two issues might be:

 "([_+a-z0-9-]+(\.[_+a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,14}))"

但它可能还有其他问题,您最好在线搜索更好的电子邮件地址正则表达式.我说更好,因为不存在完美的...

but it probably has other issues and you'd be better of searching for a better email address regex online. I say better, because a perfect one doesn't exist...

这篇关于如何使用 R 或命令行提取与文本文件中的电子邮件地址匹配的表达式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆