仅获取文本文件中的信件,电子邮件 [英] Get only body of letters, emails from text files

查看:136
本文介绍了仅获取文本文件中的信件,电子邮件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从此文本文档中删除所有从,至,抄送,主题发送的标签,并且仅保留邮件正文,以便我可以使用它来总结文档的内容.在python中执行此操作的最佳方法是什么.我认为最好先进行提取,然后再对这种情况使用预处理.还要在此处附加代码.因此,如果有人可以提出建议,那将非常有帮助.该文件的有效负载和ismultipart部分未正确完成,这是我的疑问所在,因此在该部分添加了注释并需要帮助.

I want to remove all from, to, cc, subject sent tags from this text document and only keep the body of the mail so that I can use this to summarize content of the document. What is the best way to do this in python. I think it's better to first do the extraction and then use preprocessing for this case. Also attaching code here. So if anyone can suggest how to do this, would be really helpful. The payload and ismultipart part of the file is not done properly and there is where my doubt is and so have commented that part and require help there.

下面附上代码和.txt文件以供参考.

Attaching code and the .txt file below for reference.

import os, sys, csv
import glob
import re
import email
#from tika import parser
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.summarization import summarize, keywords

# Set path to directory where files are
dirs = 'C:\\Users\\Lenovo\\.spyder-py3\\Testing\\'
#os.chdir(dirs)
for filename in glob.glob(os.path.join(dirs, '*.txt')):
    try:
        for files in filename:
            file = open(filename, 'r', encoding ='utf-8')
            filecontents = file.read()
            filecontents = re.sub(r'\s+', ' ', filecontents)
            print(filecontents)
            filecontents = filecontents.strip('\n')
            b = email.message_from_string(filecontents)# NEED
            if b.is_multipart():#HELP
                for payload in b.get_payload():#HERE
                    # if payload.is_multipart(): ...#SO
                    print (payload.get_payload())#COMMENTED
            else:#
                print (b.get_payload())#
            summary = summarize(filecontents, ratio =0.10)
            print(summary)
            kw = keywords(filecontents, words=15)
            print(kw)
            break
            #writer.writerow([file, summary, kw])
    except Exception as e:
        pass

文本文件

 Stephanie /ANN

From: Mr.A,  <.Mr.A@abc.com>
Sent: Wednesday, July 25, 2018 2:27 PM
To: , Tim /ANN; Abd, May /ANN
Cc: Mr.A, ; Theoder Jerry,
Subject: [EXTERNAL] RE:  Holdings: XXXX SPA – mfno.1322

Dear Dr. Tim A. , 

The option-2 is fine. By the way, we had received in the past Letter of Authorization for many companies other 
than Spa and I guess Xxxx does not do bANNiness with them either. If yes, then need to submit withdrawal 
of Letter of Authorization for those companies and send a Letter of Authorization for spa. stating for any 
applications submitted. We will send an administrative filing issue letter for both the holder and the agent.  



Thank you! 

Regards, 
 Mr.A 
PRODUCT Master File 
CDER 



Currently, there is no requirement to submit or resubmit NAs in any electronic format.  However, starting May 5, 2018, 
new NAs, as well as any submissions to the existing NAs mANNt be submitted electronically in legal (electronic Common 
Technical Document) format specified by GROUP A in the legal guidance. NA submissions that are not submitted in legal 
format after this date may be subject to rejection. For more information please check the NA website 
www.GROUP A.gov/abc/bca 


This communication is an informal communication consistent with which represents my best judgment 
at this time, but does not constitute an advisory opinion, does not necessarily represent the formal position of the 
GROUP A, and does not bind or otherwise obligate or commit the agency to the views expressed. This communication, 
including any attachments, is intended only for the person or entity to which it is addressed and may contain 
confidential material. Any review, retransmission, distribution or other ANNe of this information by persons or entities 
other than the intended recipient is prohibited. If you received this in error, please destroy any copies, contact the 
sender and delete the material from any computer. Thank you. 

From: Tim.@xxxx.com [mailto:Tim.@xxxx.com]  
Sent: Wednesday, July 25, 2018 2:10 PM 
To: Mr.A,  <.Mr.A@abc.com> 
Cc: May.Abd@xxxx.com 
Subject: RE: Holdings: XXXX SPA ‐ dm 013383 

Dear , 


XXXX



2

Thanks for your phone call to clarify your needs and to understand the situation. I have confirmed that Xxxx only does 
direct bANNiness for test  S intermediate with b. and not with the other companies (e, 
x, etc.) that are secondary companies. Based on our discANNsion, I believe that we do not need to 
provide QAs for these secondary companies or mention them in our NA file as they would be covered under a 
separate QA  S.p.A. to them. If this is correct, then I believe you mentioned that we have two options as 
described below: 

Option 1: We can issue a separate QA for each . NA to be specific on which NA is being cross‐referenced 
to our NA 13383. 

Option 2: We can do a single QA for  and mention that they can cross‐reference any of their NAs. This 
would allow them to cross‐reference any of their 

If I have misunderstood or am incorrect in my response and we need to discANNs further, please let me know. 

If not, when you issue your request, can you please send to me and May Abd by email? 

Kind regards. 

Tim 

Tim A. , BsC 
Director, YY SERVICES) 
Xxxx ANN 
Phone/FAX: 2312333 
Cell: 23312123131 
Email: tim.@xxxx.com 



From: , Tim /ANN  
Sent: Monday, July 23, 2018 7:05 AM 
To: 'Mr.A, ' 
Cc: Abd, May /ANN 
Subject: RE: [EXTERNAL] Holder: XXXX SPA - NA 013383 

Dear , 

May is now on vacation and I am covering for her during her absence. Is there a good time to call you today or later this 
week? Please let me know and we can schedule or please call my cell phone 21313131231 at your convenience. 

Kind regards. 

Tim 

Tim A. , MSC 
Director, PQR 
Xxxx 
Phone/FAX: 2312313313 
Cell: 3142342424 
Email: tim.@xxxx.com 



XXXX



3


‐‐‐‐‐‐‐‐‐‐ Forwarded message ‐‐‐‐‐‐‐‐‐‐ 
From: "Mr.A, " <.Mr.A@abc.com> 
Date: Jul 20, 2018 9:01 AM 
Subject: [EXTERNAL] Holder: XXXX SPA ‐ NA 013383 
To: "TRETE/ANN" <May.Abd@xxxx.com> 
Cc: "mno.com> 

Dear May Abd, 

. I need to talk to you on this.  

Thank you! 

Regards, 
 Mr.A 
PRODUCT Master File 
CDER 


Currently, there is no requirement to submit or resubmit NAs in any electronic format.   
format after this date may be subject to rejection. For more information please check the NA website 
www.GROUP A./cder/NA   


This communication is an informal communication  which represents my best judgment 
at this time, but does not constitute an advisory opinion, does not necessarily represent the formal position of the 
GROUP A, and does not bind or otherwise obligate or commit the agency to the views expressed. This communication, 
including any attachments, is intended only for the person or entity to which it is addressed and may contain 
confidential material. Any review, retransmission, distribution or other ANNe of this information by persons or entities 
other than the intended recipient is prohibited. If you received this in error, please destroy any copies, contact the 
sender and delete the material from any computer. Thank you. 


XXXX

推荐答案

尚不清楚您需要帮助的代码的哪一部分,您希望它做什么而不是当前的工作或如何传递代码.结果以供进一步处理.

It's not really clear which part of the code you need help with, what you want it to do instead of what it currently does, or how to pass on the results for further processing correctly.

但是,我会注意到您的代码有很多问题.

However, I will note that your code has a number of problems.

  • 您无法以UTF-8文本形式阅读电子邮件.无论文件扩展名如何,RFC822消息都只是一个字节序列.传统电子邮件可能采用多种不同的编码,如果您尝试将其强制转换为UTF-8,则会遇到UnicodeDecodeError和其他障碍.
  • 一如既往,毯子except Exception:是主要错误.也许您只是将其放入调试中,但实际上会使调试更加困难.
  • 典型的现代电子邮件带有一些复杂的MIME正文结构,您必须在上下文中对其进行分析,然后再决定要实际处理的内容. multipart/alternative是一种常见的现象,其中相同的消息以不同的格式呈现,因此收件人可以决定是否要阅读以HTML,纯文本形式呈现的消息,还是偶尔以PDF或RTF或单个图像或其他形式呈现的消息,具体取决于在应用程序上.另外,HTML结构通常包含多个部分,因为主要的HTML也希望提取MIME结构中提供的小图像(公司徽标,动画表情符号以及对读者的其他侮辱).也许也请参见什么是部件"?在多部分电子邮件中?
  • You cannot read an email message as UTF-8 text. Regardless of the file extension, an RFC822 message is simply a sequence of bytes. Traditional email could come in a large number of different encodings, and if you try to coerce it into UTF-8, you will run into UnicodeDecodeErrors and other snags.
  • As always, a blanket except Exception: is a major bug. Perhaps you only put this in for debugging, but it actually makes debugging harder.
  • Typical modern email messages come with somewhat complex MIME body structures which you have to analyze in context before you decide which one(s) you actually want to process. One common phenomenon is multipart/alternative where the same message is rendered in different formats so that recipients can decide whether they want to read it rendered as HTML, plain text, or, occasionally, perhaps PDF or RTF or a single image or whatever, depending on the application. Also, HTML structures often have multiple parts, because the main HTML wants to pull in small images which are supplied in the MIME structure as well (company logo, animated emojis, and other insults to the reader). Perhaps see also What are the "parts" in a multipart email?

此答案的另一个复杂之处是Python的email库相对较新地经历了大修.新功能是在Python 3.3中实验性引入的,但直到3.6才成为文档记录和默认版本.您将在野外发现的大多数代码都将使用3.6之前的功能,但展望未来,您可能希望针对新的和改进的API.

Another complication for this answer is that Python's email library went through an overhaul relatively recently. The new functionality was introduced experimentally in Python 3.3, but only became the documented and default version in 3.6. Most of the code you will find out in the wild will be using the pre-3.6 facilities, but going forward, you will probably want to target the new and improved API.

使用旧版API,您的代码可能类似于

With the legacy API your code might look something like

from email import message_from_binary_file

for filename in glob.glob(os.path.join(dirs, '*.txt')):
    # Not useful; we already have a filename
    #for files in filename:
    # Open in binary mode, don't try to guess encoding
    # Use a context manager so we don't leave the file open
    with open(filename, 'rb') as file:
        # Just let the email library take it from here
        #filecontents = file.read()
        #filecontents = re.sub(r'\s+', ' ', filecontents)
        #print(filecontents)
        #filecontents = filecontents.strip('\n')
        b = email.message_from_binary_file(file)
    if b.is_multipart():
        # There are a number of things you could do to pick out
        # one or more payloads for analysis, but let's just take
        # the first text/plain part and call it "main_part"
        for part in b.walk()
            if part.get_content_type() == 'text/plain':
                main_part = part.get_payload()
                break
    else:
        main_part = b.get_payload()
    summary = summarize(main_part, ratio =0.10)
    print(summary)
    kw = keywords(main_part, words=15)
    print(kw)

要使用新的3.6+ API,您需要对此进行调整,使其类似

To use the new 3.6+ API you will need to adapt this to something like

from email.policy import default as default_email_policy
...
    b = email.message_from_binary_file(file, policy=default_email_policy)
    main_part = b.get_body(['related', 'plain', 'html'])

这将导致一个新的email.message.EmailMessage对象,该对象具有与旧版email.message.Message类不同的方法和不同的行为.该文档建议,默认情况下,可能有一天默认会传递默认的policy,届时旧代码将切换为新行为(但也可能会出现一些令人不愉快的意外情况和彻底的破坏).

This will result in a new email.message.EmailMessage object which has some different methods and different behaviors than the legacy email.message.Message class. The documentation suggests that maybe one day the default policy will be passed in by default, at which point old code will switch to new behavior (but also probably some amount of unpleasant surprises and outright breakage).

还要注意 get_body()方法是3.6中的新功能,可让您轻松选择可能的主要零件";尽管如果没有text/plain部分可用,则上面的代码将退回到HTML,然后您将需要对其进行进一步处理以提取实际文本(请参见

Notice also the get_body() method which is new in 3.6 and which lets you easily pick out a "probable main part"; though if no text/plain part is available, the code above will fall back to HTML, which you will then need to process further to extract the actual text (look at Beautifulsoup maybe?)

没有技术,可靠,可靠的方法可以将样板(标头,签名等)与电子邮件中的实际内容分开.某些HTML电子邮件客户端可能会在生成的消息中提供有关<div>包含用户键入内容的提示,但是在通常情况下,您只需要对(坦率地说,绝望的)启发式方法大为惊讶.

There is no technical, robust, reliable way to separate boilerplate (headers, signatures, etc) from actual content in email. Some HTML email clients might provide hints in the generated message as to which <div> contains things the user typed in, but in the general case, you just have to wade up to your eyebrows in (frankly, hopeless) heuristics.

这篇关于仅获取文本文件中的信件,电子邮件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆