解析“从”地址从电子邮件文本 [英] Parsing "From" addresses from email text

查看:211
本文介绍了解析“从”地址从电子邮件文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从电子邮件的纯文本转录中提取电子邮件地址。
我已经拼凑了一些代码,找到地址本身,但我不知道如何使它们之间区分开来;现在它只是吐出文件中的所有电子邮件地址。我想要这样做,它只会吐出From:和几个通配符前面的地址,以>结尾(因为电子邮件设置为From [name]< [email] >)。



现在的代码是:

  import re #allows程序使用正则表达式
foundemail = []
#this是一个空列表

mailsrch = re.compile(r'[\w\ - ] [\w\\ \\-\。] + @ [\w\ - ] [\w\-\。] + [a-zA-Z] {1,4}')
#do目前不知道这个表达式的确切含义,但假设
#it意味着像[stuff] @ [stuff] [stuff1-4 letters]

#line是一个变量设置为从文件
#(text.txt)中读取的单行:
在open(text.txt)中:

foundemail.extend(mailsrch。 findall(line))

#通过mailsrch变量
#扩展了以前命名的列表,
$之前命名为

print foundemail


解决方案

我会通过扩展正在使用的正则表达式来实现您要匹配的额外文字。首先,让我解释一下正则表达式:

  [\w\  - ] [\w\- \。] + @ [\w\  - ] [\w\-\。] + [a-zA-Z] {1,4} 
/ pre>


  • [\w\ - ] 匹配任何单词字符(字母,数字或下划线),连字符

  • [\w\-\。] + / code>匹配(任何字符字符连字符期间)一次或多次

  • @ 匹配文字'@'

  • [\w\ - ] 匹配一个单词字符或连字符

  • [\w\-\。] + 匹配一个或多个字符,连字符和/或周期

  • [a-zA-Z] {1,4} 匹配1,2,3或4小写或大写字母



现在,为了您的目的修改这个,我们添加正则表达式部分以匹配From,名称和尖括号:

  From:[\w\s] +? ≤([\w\  - ] [。\w\-\] + @ [\w\  - ] [。\w\-\] + [A-ZA-Z ] {1,4})> 




  • 从:匹配文字从:

  • [\w\s] +?匹配一个或多个连续字符空格。问号使匹配非贪心,所以它将尽可能匹配尽可能少的字符,同时仍然允许整个正则表达式匹配(在这种情况下,它可能不是必需的,但它确实使匹配更有效率,因为事情之后不是一个字符字符或空格字符)。

  • 匹配文字小于符号(开角括号)

  • 您之前使用的正则表达式现在被括号括起来。这使得它成为一个捕获组,所以您可以调用 m.group(1)来获取正则表达式的那部分匹配的文本。 / li>
  • > 匹配字面大于符号



由于正则表达式现在使用捕获组,您的代码将需要更改一些:

  import re 
foundemail = []

mailsrch = re.compile(r'From:[\w\s] +?<([\w\ - ] [\ w \-\。] + @ [\w\ - ] [\w\-\。] + [a-zA-Z] {1,4})>')

for open in(text.txt):
foundemail.extend([m.group(1)for m in mailsrch.finditer(line)])

print foundemail

代码 [m.group(1)for m in mailsrch.finditer(line)] 从正则表达式找到的每个匹配项中生成第一个捕获组中的列表(记住括号中的一部分)。


I'm trying to extract email addresses from plain text transcripts of emails. I've cobbled together a bit of code to find the addresses themselves, but I don't know how to make it discriminate between them; right now it just spits out all email addresses in the file. I'd like to make it so it only spits out addresses that are preceeded by "From:" and a few wildcard characters, and ending with ">" (because the emails are set up as From [name]<[email]>).

Here's the code now:

import re #allows program to use regular expressions
foundemail = []
#this is an empty list

mailsrch = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
 #do not currently know exact meaning of this expression but assuming
 #it means something like "[stuff]@[stuff][stuff1-4 letters]"

        # "line" is a variable is set to a single line read from the file
# ("text.txt"):
for line in open("text.txt"):

    foundemail.extend(mailsrch.findall(line))

    # this extends the previously named list via the "mailsrch" variable
      #which was named before

print foundemail

解决方案

I'd do it by expanding the regular expression you're using to include the extra text you want to match. So first, let me explain what that regex does:

[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}

  • [\w\-] matches any "word" character (letter, number, or underscore), or a hyphen
  • [\w\-\.]+ matches (any word character or hyphen or period) one or more times
  • @ matches a literal '@'
  • [\w\-] matches a word character or hyphen
  • [\w\-\.]+ matches one or more word characters, hyphens, and/or periods
  • [a-zA-Z]{1,4} matches 1, 2, 3, or 4 lowercase or uppercase letters

Now, to modify this for your purposes, let's add regex parts to match "From", the name, and the angle brackets:

From: [\w\s]+?<([\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4})>

  • From: matches the literal text "From: "
  • [\w\s]+? matches one or more consecutive word characters or space characters. The question mark makes the match non-greedy, so it will match as few characters as possible while still allowing the whole regular expression to match (in this case, it's probably not necessary, but it does make the match more efficient since the thing that comes immediately afterwards is not a word character or space character).
  • < matches a literal less-than sign (opening angle bracket)
  • The same regular expression you had before is now surrounded by parentheses. This makes it a capturing group, so you can call m.group(1) to get the text matched by that part of the regex.
  • > matches a literal greater-than sign

Since the regex now uses capturing groups, your code will need to change a little as well:

import re
foundemail = []

mailsrch = re.compile(r'From: [\w\s]+?<([\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4})>')

for line in open("text.txt"):
    foundemail.extend([m.group(1) for m in mailsrch.finditer(line)])

print foundemail

The code [m.group(1) for m in mailsrch.finditer(line)] produces a list out of the first capturing group (remember, that was the part in parentheses) from each match found by the regular expression.

这篇关于解析“从”地址从电子邮件文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆