如何仅从多部分电子邮件中获取文本内容? [英] How do I get just the text content from a multipart email?
问题描述
#!/usr/bin/php -q
<?php
$savefile = "savehere.txt";
$sf = fopen($savefile, 'a') or die("can't open file");
ob_start();
// read from stdin
$fd = fopen("php://stdin", "r");
$email = "";
while (!feof($fd)) {
$email .= fread($fd, 1024);
}
fclose($fd);
// handle email
$lines = explode("\n", $email);
// empty vars
$from = "";
$subject = "";
$headers = "";
$message = "";
$splittingheaders = true;
for ($i=0; $i < count($lines); $i++) {
if ($splittingheaders) {
// this is a header
$headers .= $lines[$i]."\n";
// look out for special headers
if (preg_match("/^Subject: (.*)/", $lines[$i], $matches)) {
$subject = $matches[1];
}
if (preg_match("/^From: (.*)/", $lines[$i], $matches)) {
$from = $matches[1];
}
if (preg_match("/^To: (.*)/", $lines[$i], $matches)) {
$to = $matches[1];
}
} else {
// not a header, but message
$message .= $lines[$i]."\n";
}
if (trim($lines[$i])=="") {
// empty line, header section has ended
$splittingheaders = false;
}
}
/*$headers is ONLY included in the result at the last section of my question here*/
fwrite($sf,"$message");
ob_end_clean();
fclose($sf);
?>
这是我尝试的一个例子.问题是我的文件太多了. 这是正在写入文件的内容:(正如您所看到的,我只是向它发送了一堆垃圾)
That is an example of my attempt. The problem is I am getting too much in the file. Here is what is being written to the file: (I just sent a bunch of garbage to it as you can see)
From xxxxxxxxxxxxx Tue Sep 07 16:26:51 2010
Received: from xxxxxxxxxxxxxxx ([xxxxxxxxxxx]:3184 helo=xxxxxxxxxxx)
by xxxxxxxxxxxxx with esmtpa (Exim 4.69)
(envelope-from <xxxxxxxxxxxxxxxx>)
id 1Ot4kj-000115-SP
for xxxxxxxxxxxxxxxxxxx; Tue, 07 Sep 2010 16:26:50 -0400
Message-ID: <EE3B7E26298140BE8700D9AE77CB339D@xxxxxxxxxxx>
From: "xxxxxxxxxxxxx" <xxxxxxxxxxxxxx>
To: <xxxxxxxxxxxxxxxxxxxxx>
Subject: stackoverflow is helping me
Date: Tue, 7 Sep 2010 16:26:46 -0400
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_NextPart_000_0169_01CB4EA9.773DF5E0"
X-Priority: 3
X-MSMail-Priority: Normal
Importance: Normal
X-Mailer: Microsoft Windows Live Mail 14.0.8089.726
X-MIMEOLE: Produced By Microsoft MimeOLE V14.0.8089.726
This is a multi-part message in MIME format.
------=_NextPart_000_0169_01CB4EA9.773DF5E0
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
111
222
333
444
------=_NextPart_000_0169_01CB4EA9.773DF5E0
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content=3Dtext/html;charset=3Diso-8859-1 =
http-equiv=3DContent-Type>
<META name=3DGENERATOR content=3D"MSHTML 8.00.6001.18939"></HEAD>
<BODY style=3D"PADDING-LEFT: 10px; PADDING-RIGHT: 10px; PADDING-TOP: =
15px"=20
id=3DMailContainerBody leftMargin=3D0 topMargin=3D0 =
CanvasTabStop=3D"true"=20
name=3D"Compose message area">
<DIV><FONT face=3DCalibri>111</FONT></DIV>
<DIV><FONT face=3DCalibri>222</FONT></DIV>
<DIV><FONT face=3DCalibri>333</FONT></DIV>
<DIV><FONT face=3DCalibri>444</FONT></DIV></BODY></HTML>
------=_NextPart_000_0169_01CB4EA9.773DF5E0--
我在搜索时发现了这一点,但不知道如何实现或在代码中插入什么地方,或者是否可行.
I found this while searching around but have no idea how to implement or where to insert in my code or if it would work.
preg_match("/boundary=\".*?\"/i", $headers, $boundary);
$boundaryfulltext = $boundary[0];
if ($boundaryfulltext!="")
{
$find = array("/boundary=\"/i", "/\"/i");
$boundarytext = preg_replace($find, "", $boundaryfulltext);
$splitmessage = explode("--" . $boundarytext, $message);
$fullmessage = ltrim($splitmessage[1]);
preg_match('/\n\n(.*)/is', $fullmessage, $splitmore);
if (substr(ltrim($splitmore[0]), 0, 2)=="--")
{
$actualmessage = $splitmore[0];
}
else
{
$actualmessage = ltrim($splitmore[0]);
}
}
else
{
$actualmessage = ltrim($message);
}
$clean = array("/\n--.*/is", "/=3D\n.*/s");
$cleanmessage = trim(preg_replace($clean, "", $actualmessage));
那么,如何仅将电子邮件的纯文本区域放入文件或脚本中以进行进一步处理?
So, how can I get just the plain text area of the email into my file or script for furthr handling??
先谢谢了. stackoverflow很棒!
Thanks in advance. stackoverflow is great!
推荐答案
为了隔离电子邮件正文的纯文本部分,您必须执行四个步骤:
There are four steps that you will have to take in order to isolate the plain text part of your email body:
1.获取MIME边界字符串
我们可以使用正则表达式搜索标题(假设它们位于单独的变量$headers
中)
We can use a regular expression to search your headers (let's assume they're in a separate variable, $headers
):
$matches = array();
preg_match('#Content-Type: multipart\/[^;]+;\s*boundary="([^"]+)"#i', $headers, $matches);
list(, $boundary) = $matches;
正则表达式将搜索包含边界字符串的Content-Type
标头,然后将其捕获到第一个捕获组.然后,我们将该捕获组复制到变量$boundary
.
The regular expression will search for the Content-Type
header that contains the boundary string, and then capture it into the first capture group. We then copy that capture group into variable $boundary
.
2.将电子邮件正文分成多个部分
一旦有了边界,我们就可以将正文分为不同的部分(在您的邮件正文中,每次出现时,正文都会以--
作为开头).根据 MIME规范,应该忽略第一个边界之前的所有内容.
Once we have the boundary, we can split the body into its various parts (in your message body, the body will be prefaced by --
each time it appears). According to the MIME spec, everything before the first boundary should be ignored.
$email_segments = explode('--' . $boundary, $message);
array_shift($email_segments); // drop everything before the first boundary
这将使我们得到一个包含所有段的数组,并且忽略第一个边界之前的所有内容.
This will leave us with an array containing all the segments, with everything before the first boundary ignored.
3.确定哪个段是纯文本.
纯文本段将具有MIME类型text/plain
的Content-Type
标头.现在,我们可以在每个分段中搜索带有该标头的第一个分段:
The segment that is plain text will have a Content-Type
header with the MIME-type text/plain
. We can now search each segment for the first segment with that header:
foreach ($email_segments as $segment)
{
if (stristr($segment, "Content-Type: text/plain") !== false)
{
// We found the segment we're looking for!
}
}
由于我们要查找的是一个常量,因此可以使用 Content-Type
标头,则我们已经找到了细分.
Since what we're looking for is a constant, we can use stristr
(which finds the first instance of a substring in a string, case insensitively) instead of a regular expression. If the Content-Type
header is found, we've got our segment.
4.从细分中删除所有标题
现在,我们需要从找到的段中删除所有标头,因为我们只需要实际的消息内容.这里可以显示四个 MIME标头: c9>,Content-Disposition
和Content-Transfer-Encoding
.标头以\r\n
终止,因此我们可以使用它来确定标头的结尾:
Now we need to remove any headers from the segment we found, as we only want the actual message content. There are four MIME headers that can appear here: Content-Type
as we saw before, Content-ID
, Content-Disposition
and Content-Transfer-Encoding
. Headers are terminated by \r\n
so we can use that to determine the end of the headers:
$text = preg_replace('/Content-(Type|ID|Disposition|Transfer-Encoding):.*?\r\n/is', "", $segment);
正则表达式末尾的s
修饰符使点匹配任何换行符. .*?
将收集尽可能少的字符(即,直到\r\n
的所有字符); ?
是.*
上的惰性修饰符.
The s
modifier at the end of the regular expression makes the dot match any newlines. .*?
will collect as few characters as possible (ie. everything up to \r\n
); the ?
is a lazy modifier on .*
.
然后,$text
将包含您的电子邮件内容.
And after this point, $text
will contain your email message content.
因此,将其与您的代码放在一起:
So to put it all together with your code:
<?php
// read from stdin
$fd = fopen("php://stdin", "r");
$email = "";
while (!feof($fd))
{
$email .= fread($fd, 1024);
}
fclose($fd);
$matches = array();
preg_match('#Content-Type: multipart\/[^;]+;\s*boundary="([^"]+)"#i', $email, $matches);
list(, $boundary) = $matches;
$text = "";
if (isset($boundary) && !empty($boundary)) // did we find a boundary?
{
$email_segments = explode('--' . $boundary, $email);
foreach ($email_segments as $segment)
{
if (stristr($segment, "Content-Type: text/plain") !== false)
{
$text = trim(preg_replace('/Content-(Type|ID|Disposition|Transfer-Encoding):.*?\r\n/is', "", $segment));
break;
}
}
}
// At this point, $text will either contain your plain text body,
// or be an empty string if a plain text body couldn't be found.
$savefile = "savehere.txt";
$sf = fopen($savefile, 'a') or die("can't open file");
fwrite($sf, $text);
fclose($sf);
?>
这篇关于如何仅从多部分电子邮件中获取文本内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!