如何从pdf提取电子邮件 [英] how to extract email from pdf

查看:90
本文介绍了如何从pdf提取电子邮件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用pdfminer和正则表达式从简历中提取电子邮件

I'm trying to extract email from a cv using pdfminer and regular expressions

from io import StringIO
from pdfminer3.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer3.converter import TextConverter
from pdfminer3.layout import LAParams
from pdfminer3.pdfpage import PDFPage
import re

def get_cv_email(self, cv_path):
    pagenums = set()
    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)
    infile = open(cv_path, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close()
    match = re.search(r'[\w\.-]+@[\w\.-]+', text)
    email = match.group(0)
    return email

大多数简历都成功提取了电子邮件,但始终无法正常工作

The email is succesfully extracted for most of the resumes but it doesn't work correctly all the time

示例:jayantanathcdh@gmail.comEducationalQualification

Example : jayantanathcdh@gmail.comEducationalQualification

更新:如果电子邮件以大写字母开头,我该如何编辑正则表达式以忽略电子邮件后面的内容

UPDATE: How can I edit my regex to ignore what ever comes after the email if it starts with an uppercase

推荐答案

根据您的最新评论在匹配时匹配电子邮件,直到在 @ 之后找到大写字母为止,您可以使用此正则表达式:

Based on your last comment to match the email as you were matching until it finds an Upper case letter after @ you can use this regex:

[\w\.-]+@[a-z0-9\.-]+

举个例子:

import re
text = "jayantanathcdh@gmail.comEducationalQualification"
match = re.search(r'[\w\.-]+@[a-z0-9\.-]+', text)
email = match.group(0)

print(email)
#jayantanathcdh@gmail.com

这篇关于如何从pdf提取电子邮件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆