X-4545 X- [英] parsing raw email in php

查看:107
本文介绍了X-4545 X-的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找良好/工作/简单的使用php代码解析原始电子邮件到零件。

I'm looking for good/working/simple to use php code for parsing raw email into parts.

我写了几个强力解决方案,但是每次,一个小的变化/头/空间/一些东西出现,我的整个解析器失败,项目崩溃。

I've written a couple of brute force solutions, but every time, one small change/header/space/something comes along and my whole parser fails and the project falls apart.

在我指向PEAR / PECL之前,我需要实际的代码。我的主机有一些螺丝配置或东西,我似乎永远不会得到.so的建立权利。如果我确实得到.so所做的,路径/环境/ php.ini中的一些区别并不总是使它可用(apache vs cron vs cli)。

And before I get pointed at PEAR/PECL, I need actual code. My host has some screwy config or something, I can never seem to get the .so's to build right. If I do get the .so made, some difference in path/environment/php.ini doesn't always make it available (apache vs cron vs cli).

哦最后一件事,我正在解析原始的电子邮件文本,而不是POP3,而不是IMAP。它正在通过一个.qmail电子邮件重定向进入php脚本。

Oh, and one last thing, I'm parsing the raw email text, NOT POP3, and NOT IMAP. It's being piped into the php script via a .qmail email redirect.

我并不期望SOF为我写,我正在寻找一些提示/开始要做点正确。这是我知道已经解决的那些轮问题之一。

I'm not expecting SOF to write it for me, I'm looking for some tips/starting points on doing it "right". This is one of those "wheel" problems that I know has already been solved.

推荐答案

你希望最终得到什么最后?身体,主体,发件人,附件?您应该花一些时间与 RFC2822 了解邮件的格式,但这是最简单的电子邮件的格式:

What are you hoping to end up with at the end? The body, the subject, the sender, an attachment? You should spend some time with RFC2822 to understand the format of the mail, but here's the simplest rules for well formed email:

HEADERS\n
\n
BODY

HEADER如下所示:

That is, the first blank line (double newline) is the separator between the HEADERS and the BODY. A HEADER looks like this:

HSTRING:HTEXT

HSTRING始终始于一行的开头,不包含任何空格或冒号。 HTEXT可以包含各种各样的文本,包括换行符,只要换行符后跟空格。

HSTRING always starts at the beginning of a line and doesn't contain any white space or colons. HTEXT can contain a wide variety of text, including newlines as long as the newline char is followed by whitespace.

BODY实际上只是第一个双重换行。 (如果您通过SMTP发送邮件有其他规则,但是通过管道进行处理,您不必担心)。

The "BODY" is really just any data that follows the first double newline. (There are different rules if you are transmitting mail via SMTP, but processing it over a pipe you don't have to worry about that).

所以,真的很简单,大约在1982年 RFC822 条款,电子邮件如下所示:

So, in really simple, circa-1982 RFC822 terms, an email looks like this:

HEADER: HEADER TEXT
HEADER: MORE HEADER TEXT
  INCLUDING A LINE CONTINUATION
HEADER: LAST HEADER

THIS IS ANY
ARBITRARY DATA
(FOR THE MOST PART)



大多数现代电子邮件比这更复杂。标题可以编码为charsets或 RFC2047 mime字,或一吨的其他东西我'现在没有想到。如果你希望他们有意义的话,这些机构真的很难转载自己的代码。几乎所有由MUA生成的电子邮件将 MIME 编码。这可能是uuencoded文本,它可能是html,它可能是一个uuencoded excel电子表格。

Most modern email is more complex than that though. Headers can be encoded for charsets or RFC2047 mime words, or a ton of other stuff I'm not thinking of right now. The bodies are really hard to roll your own code for these days to if you want them to be meaningful. Almost all email that's generated by an MUA will be MIME encoded. That might be uuencoded text, it might be html, it might be a uuencoded excel spreadsheet.

我希望这有助于提供一个框架来了解一些非常元素的电子邮件。如果您提供更多关于您尝试处理数据的背景知识,我(或其他人)可能会提供更好的方向。

I hope this helps provide a framework for understanding some of the very elemental buckets of email. If you provide more background on what you are trying to do with the data I (or someone else) might be able to provide better direction.

这篇关于X-4545 X-的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆