从引用回复解析电子邮件内容 [英] Parse email content from quoted reply

查看:648
本文介绍了从引用回复解析电子邮件内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想弄清楚如何从任何引用回复文本,它可能包括解析出电子邮件的文本。我注意到,通常电子邮件客户端将会把一个在某某日期谁谁写了或preFIX的线条与角度支架。不幸的是,不是每个人都这样做了。有没有人对如何以编程方式检测短信回复任何想法?我使用C#写这个解析器。

I'm trying to figure out how to parse out the text of an email from any quoted reply text that it might include. I've noticed that usually email clients will put an "On such and such date so and so wrote" or prefix the lines with an angle bracket. Unfortunately, not everyone does this. Does anyone have any idea on how to programmatically detect reply text? I am using C# to write this parser.

推荐答案

我做了很多搜索就这个问题和这里就是我发现。基本上有下,你这样做是两种情况:当你拥有整个线程,当你不知道。我会打破它分为这两个类别:

I did a lot more searching on this and here's what I've found. There are basically two situations under which you are doing this: when you have the entire thread and when you don't. I'll break it up into those two categories:

当你有螺纹:

如果您有全系列的电子邮件,就可以实现的保证,要删除什么是真正引用文本中一个非常高的水平。有两种方法可以做到这一点。一,你可以使用邮件的邮件ID,在 - 答复 - 要ID和线程指数决定了个人信息,它的父,线程它属于。有关更多信息,请参见 RFC822 ,的 RFC2822 穿线这个有趣的文章,或的this穿线文章。一旦你已经重新组装线,你就可以删除外部文本(如收件人,发件人,CC等...行),你就大功告成了。

If you have the entire series of emails, you can achieve a very high level of assurance that what you are removing is actually quoted text. There are two ways to do this. One, you could use the message's Message-ID, In-Reply-To ID, and Thread-Index to determine the individual message, it's parent, and the thread it belongs to. For more information on this, see RFC822, RFC2822, this interesting article on threading, or this article on threading. Once you have re-assembled the thread, you can then remove the external text (such as To, From, CC, etc... lines) and you're done.

如果您正在使用的消息没有标题,你也可以使用相似性匹配来确定电子邮件的部分回复文本。在这种情况下你坚持做相似性匹配来确定重复的文字。在这种情况下,你可能想寻找到一个 Levenshtein距离算法如的这一项上code项目或的this 之一。

If the messages you are working with do not have the headers, you can also use similarity matching to determine what parts of an email are the reply text. In this case you're stuck with doing similarity matching to determine the text that is repeated. In this case you might want to look into a Levenshtein Distance algorithm such as this one on Code Project or this one.

不管是什么,如果你有兴趣在穿线过程中,请查看<一个href=\"http://academiccommons.columbia.edu/download/fedora_content/download/ac:162861/CONTENT/yeh_harnly_06.pdf\">this伟大的PDF重装上线电子邮件的。

No matter what, if you're interested in the threading process, check out this great PDF on reassembling email threads.

当你没有螺纹:

如果你坚持用从线只有一个消息,你在做不得不尝试猜测报价是什么。在这种情况下,这里是我所见过的不同报价方式:

If you are stuck with only one message from the thread, you're doing to have to try to guess what the quote is. In that case, here are the different quotation methods I have seen:


  1. 一行(如图Outlook)中。

  2. 尖括号

  3. ---原始邮件---

  4. 在这样-某某天,某某这样写道:

从那里下来删除文本,你就大功告成了。缺点任何这些是他们都认为发件人提上引用文本之上他们的答复,并没有交错它(因为是老式在互联网上)。如果出现这种情况,祝你好运。我希望这有助于你们当中有些人在那里!

Remove the text from there down and you're done. The downside to any of these is that they all assume that the sender put their reply on top of the quoted text and did not interleave it (as was the old style on the internet). If that happens, good luck. I hope this helps some of you out there!

这篇关于从引用回复解析电子邮件内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆