电子邮件解析和处理建筑 [英] Email parsing and processing architechture

查看:262
本文介绍了电子邮件解析和处理建筑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在处理每封电子邮件时,我可以做一个沉重的过程。让我们说,为一个系统制作AI,他会自动回复他收到的电子邮件,但我仍然不知道从哪里开始。





架构1



问题:


  1. 让我们说我们有1000个电子邮件/秒邮件服务器,exim或sendmail,davecot等如何正常工作?


  2. parseandsavetomysql.py可以处理1000个电子邮件一秒钟虽然管道?这又如何工作? btw目前工作正常,但我需要知道这一点。


  3. 是我对于工人的逻辑正确吗?或排队系统?我试图看到resque和朋友,但我仍然只是没有得到它我们如何锁定一个会话让我们说在这个这个问题嘿我处理这个文件不工作在email1.rawemail工作在其他我们该如何做到正确或更简单的方式?


架构2








  1. 如书面

  2. pop / stmp服务器如何接收1000个电子邮件/秒?

  3. 我们可以通过imap和pop获取电子邮件?因为我们正在处理的是pop3选择性能的正确方法?

addon


  1. 是否有一个很好的链接或博客文章解决与我相同的问题?

  2. 请给我链接项目,应用程序或3rd解决我的问题的各方?

  3. 如果有任何想法,请写下来。



感谢您的帮助,Adam Ramadhan



编辑我目前的架构



解决方案

像很多大图架构的问题一样,最好的解决方案就是其中之一...这取决于它。你可以控制部署环境吗?那就是...你可以使用你想要的任何电子邮件服务器,还是限制使用已经安装和托管的电子邮件服务器?您可以在与SMTP服务相同的机器上运行代码吗?应该考虑这些问题,还有很多其他问题需要考虑一个(近)最优的架构。



鉴于此,我将作出几个假设并提供一些我认为值得探索的想法...



您应该研究一个高性能的消息系统。具体来说,请查看 RabbitMQ 。 RabbitMQ是可靠和高效的,基于异步传入事件的工作负载分配是他们(在我看来很好)教程中专门讨论的一种模式。



使用这样的消息传递服务器,您有一个进程收到传入的电子邮件。优选地,这是作为SMTP进程的一部分完成的,或至少非常接近它,特别是在您提到的工作负载中。如果您没有其他选择,那么您现在就可以通过POP或IMAP将cron收集消息的想法放在一起了。



电子邮件收集过程将然后将消息推送到RabbitMQ队列中。 (也许不是字面上的电子邮件本身,尽管这是一种可能性,但是我更多地考虑到电子邮件被有效存储在哪里)。然后,您将运行订阅一个命名消息队列的多个工作进程。然后,RabbitMQ(或您决定的任何消息服务)将以循环方式将这些消息分发给各个订阅者。如果已经加载,工作进程可以NACK消息,或者发送自己的控制流消息给服务。由于非常高的工作负载(再次像您所提出的那样),我强烈建议采用某种管理流程来保持分布式系统的整体运行状况。经理将收集运行时间统计信息(对整个系统的未来增长计划,优化和重构非常有用),并且能够启动和关闭新的工作进程。在您获得非常高的工作负载之前,假设您的工作进程稳定,可以在没有内存碎片等的情况下运行很长时间,那么只需使用消息服务器分发工作就足够了。



对于什么是值得的,我在编写电子邮件处理器方面有一些经验(特别是 xmail - 一个我建议,如果你刚刚开始你的项目,并对其早期阶段有很多的控制权)。此外,我目前正在使用RabbitMQ为主要的科学计算网格构建多代理结果缓存系统。



无论如何,祝你运气好! / p>

ok im doing a heavy process on processing each email. lets say im making an AI for a system at he will auto-reply the email that he receive, but im still dont know where to start.

heres what im thinking of

architecture 1

problems :

  1. lets say we have 1000 emails / sec how does a mail server, exim or sendmail, davecot etc exactly work?

  2. can the parseandsavetomysql.py process 1000 emails in a sec though piping? how does that work too? btw currently its working fine, but i need to know about this.

  3. is my logic correct about a worker? or a queuing system? i have tried to see resque and friends but i still just dont get it how can we lock a session lets say in this this problem "hey im processing this file dont work on email1.rawemail work on other" how can we do that the correct or simpler way?

architecture 2

problems?

  1. as written
  2. how can a pop/stmp server receive 1000 emails/sec?
  3. we can get email via imap and pop? becouse we are just processing is pop3 the right way to chose on performance? there is a imap_open on php that im currently using

addon

  1. is there a good link or blog post that solve the same problem as me?
  2. please give me links of projects,app or 3rd parties that solve my problem?
  3. if there is anything in mind, please do write them down.

thanks for helping out, Adam Ramadhan

edited my current architecture

解决方案

Like a lot of "big picture" architecture questions, the best solution is really one of those...it depends. Can you control the deployment environment? That is...can you use whatever e-mail server you'd like, or are you constrained to using one that's already installed and hosted? Can you run code on the same machine as the SMTP service? These questions, and a lot of others should be considered to come up with an (near) optimal architecture.

Given that, I'm going to make a couple of assumptions and offer some ideas that I think are worth exploring...

You should look into a high-performance messaging system. Specifically, take a look at RabbitMQ. RabbitMQ is reliable and efficient, and the distribution of workload based on asynchronous incoming events is a pattern that they specifically discuss in their (in my opinion, very good) tutorials.

With a messaging server like this, you have one process that receives the incoming e-mail. Preferably this is done as part of the SMTP process, or at least very close to it - especially with the work load that you've mentioned. If you have no other choice, then your ideas about using cron to gather messages via POP or IMAP will have to work, for now.

The e-mail gathering process would then push messages into the RabbitMQ queue. (Perhaps not literally the e-mails themselves, although that is a possibility, but I was thinking more like references to where the e-mail is efficiently stored). You then run multiple worker processes that are subscribed to a named message queue. RabbitMQ (or whatever messaging service you decide upon) would then distribute those messages in a round-robin fashion to the individual subscribers. If already loaded, worker processes can NACK the message, or send their own control flow message back to the service. With a VERY high workload (again, like you've proposed), I'd highly recommend some kind of management process that keeps tabs on the overall health of the distributed system. The manager would gather run time statistics (VERY useful for future growth planning, optimization, and refactoring of the overall system), and have the ability to spin up and shut down new worker processes. Before you get to that very high workload, and assuming that your worker processes are stable and can live a long time without memory fragmentation, etc., then just using the message server to distribute work should suffice.

For what it's worth, I've had some experience on writing e-mail processors (specifically xmail - one that I'd recommend if you're just starting out your project and have a lot of control over its early stages). Also, I'm currently using RabbitMQ to build a multi-agent result caching system for a major scientific computing grid.

Anyway...good luck with your project!

这篇关于电子邮件解析和处理建筑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆