如何从pdf提取电子邮件 [英] how to extract email from pdf
本文介绍了如何从pdf提取电子邮件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试使用pdfminer和正则表达式从简历中提取电子邮件
I'm trying to extract email from a cv using pdfminer and regular expressions
from io import StringIO
from pdfminer3.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer3.converter import TextConverter
from pdfminer3.layout import LAParams
from pdfminer3.pdfpage import PDFPage
import re
def get_cv_email(self, cv_path):
pagenums = set()
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = open(cv_path, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close()
match = re.search(r'[\w\.-]+@[\w\.-]+', text)
email = match.group(0)
return email
大多数简历都成功提取了电子邮件,但始终无法正常工作
The email is succesfully extracted for most of the resumes but it doesn't work correctly all the time
示例:jayantanathcdh@gmail.comEducationalQualification
Example : jayantanathcdh@gmail.comEducationalQualification
更新:如果电子邮件以大写字母开头,我该如何编辑正则表达式以忽略电子邮件后面的内容
UPDATE: How can I edit my regex to ignore what ever comes after the email if it starts with an uppercase
推荐答案
根据您的最新评论在匹配时匹配电子邮件,直到在 @
之后找到大写字母为止,您可以使用此正则表达式:
Based on your last comment to match the email as you were matching until it finds an Upper case letter after @
you can use this regex:
[\w\.-]+@[a-z0-9\.-]+
举个例子:
import re
text = "jayantanathcdh@gmail.comEducationalQualification"
match = re.search(r'[\w\.-]+@[a-z0-9\.-]+', text)
email = match.group(0)
print(email)
#jayantanathcdh@gmail.com
这篇关于如何从pdf提取电子邮件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文