从正规电子邮件提取正文 [英] extract body from raw email with regex

查看:122
本文介绍了从正规电子邮件提取正文的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

   -  047d7b33d6decd251504bfe78895 
内容类型:multipart / alternative; boundary = 047d7b33d6decd250d04bfe78893

--047d7b33d6decd250d04bfe78893
内容类型:text / plain; charset = UTF-8

twest

ini sebuah proiduct abru

awdawdawdawdwa

aw
awdawdaw

--047d7b33d6decd250d04bfe78893
内容类型:text / html; charset = UTF-8
内容传输编码:quoted-printable

< div class = 3Dgmail_quote> twest = C2 = A0< div>< br>< ; / div>< div> ini sebuah proidu =
ct abru< / div>< div>< br>< / div>< div> awdawdawdawdwa< / div>< div>< br>< / div>< div> aw =
< / div>< div> awdawdaw< / div>
< / div>< br>

--047d7b33d6decd250d04bfe78893--




  1. 如何使用正则表达式获取邮件文本/简单文本/ HTML内容?

  2. 电子邮件只有1个内容正文?包含一个text / html和一个text / plain

*是一个代码段, p>

  $ parts = explode(' - ',$ this-> rawemail); 
$ this-> headers = imap_rfc822_parse_headers($ this-> rawemail);
#var_dump($ parts);
#处理零件
foreach($ parts as $ part)
{
#获取内容文本/ plain
if(preg_match('/ Content-Type:text \ / plain; /',$ part))
{
$ body_parts = preg_split('/ \\\
\\\
/',$ part);

#如果在换行符(Headers)上方
if($ body_parts [0])
{
#var_dump($ body_parts [0]);
}

#如果下面的换行符($)
if($ body_parts [1])$ ​​b $ b {
var_dump($ body_parts [1]) ;
}
}

#获取内容text / html
if(preg_match('/ Content-Type:text\ / html; /',$ part) )
{
$ body_parts = preg_split('/ \\\\
/',$ part);

#如果在换行符(Headers)上方
if($ body_parts [0])
{
#var_dump($ body_parts [0]);
}

#如果下面的换行符($)
if($ body_parts [1])$ ​​b $ b {
var_dump($ body_parts [1]) ;
}
}


解决方案

认为您最好一次下来电子邮件行,因为这是在电子邮件形成中更重要的换行符。



您的规则将是: / p>


  • 如果你得到一个双行中断,那么正文正在开始 - 纯文本类型(因为没有标题来指示哪个) / li>
  • 否则,继续进行,直到得到boundary =位,然后记录边界和跳转到寻找边界模式。

  • <然后,当您找到边界时,跳入寻找内容类型或双重新行模式,并查找Content-Type(和注释内容类型)或双重新行(标题已完成,正文下一个直到下一个边界)
  • 在阅读消息的正文时,您将回到寻找边界模式来重复执行过程。



我记得很长一段时间o - 所以以下可能不是100%准确,但我会提到,以防万一。小心使用attachemnts的文件,因为你可以得到两个边界标记。但是一个边界是另一个边界,所以如果你遵循上面的规则(即抓住第一个界限并坚持下去),那么你应该很好。但是用一些attachemnts测试您的脚本:)






编辑:问题中提出的附加信息。电子邮件可以具有与用户希望编码一样多的主体。您可以使用纯文本,HTML格式的UTF编码版本,RTF版本或莫尔斯版本(如果客户端知道如何处理内容类型莫尔斯/代码)。有时你不会得到纯文本,只有HTML版本(顽皮的用户)。有时,HTML实际上没有内容类型声明(可能或可能不会显示为HTML,具体取决于客户端)。边界也分开了附件。丰富的测试是从Outlook的一个getcha(虽然,公平的,它通常被转换为HTML)。所以没有,有0和X之间的东西。


--047d7b33d6decd251504bfe78895
Content-Type: multipart/alternative; boundary=047d7b33d6decd250d04bfe78893

--047d7b33d6decd250d04bfe78893
Content-Type: text/plain; charset=UTF-8

twest

ini sebuah proiduct abru

awdawdawdawdwa

aw
awdawdaw

--047d7b33d6decd250d04bfe78893
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div class=3D"gmail_quote">twest=C2=A0<div><br></div><div>ini sebuah proidu=
ct abru</div><div><br></div><div>awdawdawdawdwa</div><div><br></div><div>aw=
</div><div>awdawdaw</div>
</div><br>

--047d7b33d6decd250d04bfe78893--

  1. how can i get the mail text/plain and the text/html content with regex?
  2. does an email only have 1 content body? consisting a text/html and a text/plain

*heres a snippet what im currently doing it wrong.

    $parts = explode('--', $this->rawemail);
    $this->headers = imap_rfc822_parse_headers($this->rawemail);
    # var_dump($parts);
    # Process the parts
    foreach ($parts as $part) 
    {
        # Get Content text/plain
        if (preg_match('/Content-Type: text\/plain;/', $part)) 
        {
            $body_parts = preg_split('/\n\n/', $part);

            # If Above the newline (Headers)
            if ($body_parts[0]) 
            {
                # var_dump($body_parts[0]);
            }

            # If Below the newline (Data)
            if ($body_parts[1]) 
            {
                var_dump($body_parts[1]);
            }
        }

        # Get Content text/html
        if (preg_match('/Content-Type: text\/html;/', $part)) 
        {
            $body_parts = preg_split('/\n\n/', $part);

            # If Above the newline (Headers)
            if ($body_parts[0]) 
            {
                # var_dump($body_parts[0]);
            }

            # If Below the newline (Data)
            if ($body_parts[1]) 
            {
                var_dump($body_parts[1]);
            }
        }

解决方案

I think you'd be better going down the email line at a time as it's the line breaks that are more critical in e-mail formation.

Your rules would be:

  • If you get a double line break, then the body is starting - plain text type (as there are no headers to indicate which).
  • Otherwise, carry on until you get the "boundary=" bit, and then you record the boundary and hop into a "looking for boundary" mode.
  • Then, when you find a boundary, hop into "Looking for content-type or double new-line" mode, and look for Content-Type (and note content-Type) or double new-line (header has finished, body coming next until the next boundary)
  • While reading the body of the message, you're back in "looking for boundary" mode to repeat teh process.

Something I remember from a long time ago - so the following may not be 100% accurate, but I'll mention just in case. Be careful with files with attachemnts as you can get two "boundary" markers. But one boundary is withing another boundary, so if you follow the rules above (i.e. grab the first boundary and stick with it) then you should be fine. But test your script with some attachemnts :)


Edit: additional info as asked in the question. An e-mail can have as many "bodies" as the user wishes to encode. You can have a plain, and HTML, a UTF encoded version, and RTF version or even a Morse Code version (if the client knew how to handle "Content-Type Morse/Code"!). Sometimes you don't get plain text, but only HTML versions (naughty users). Sometimes the HTML actually comes without the content type declaration (which may or may not get displayed as HTML, depending on the client). The boundary also splits off the attachments. Rich test is a gotcha from Outlook (although, to be fair, it usually IS converted to HTML). So no, there's somewhere between 0 and X bodies.

这篇关于从正规电子邮件提取正文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆