设计问题的通知系统 [英] Design Question for Notification System

查看：186 发布时间：2015/11/30 22:37:23 algorithm design search search-engine web-crawler

本文介绍了设计问题的通知系统的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

下面是进一步澄清的问题：通知系统，目的是让用户通知（通过电子邮件现在）时，该网站的内容已经更改或更新，或新发布而成。这可以被视为一个通知系统，其中人定义规则或关键字为第三方网站和通知系统熄灭crawle第三方网站和包装箱搜索倒排索引。然后，一个新的链接或文件显示，要求用户定义的关键字或规则（底部关于使用情况的详细说明），

Here is more clarification of the problem: The notification system purpose is to get user notified (via email for now) when content of the site has changed or updated, or new posting is made. This could be treated as a notification system where people define a rule or keyword for 3rd party site and notification system goes out crawle 3rd party site and crate search inverted indexes. Then a new link or document show up for user defined keyword or rule (more explanation at bottom regarding use case),

有关澄清使用情况：让想我是Craigslist的用户，并在寻找二手车。我定义规则本田雅阁，年和1996年的价格范围内，从$ 2000至$ 3000。

For clarified used case: Let suppose I am craigslist user and looking for used vehicle. I define a rule "Honda accord", "year " 1996 and price range from "$2000 to $3000".

有关上述用例的工作什么是最好的方法，我如何利用开源技术，如Apache朗讯，Apache Solr实现和Apache Nutch的，和Apache的Hadoop解决这个用例。你可以建设搜索引擎的事情，并与规则和关键字通知系统。我只是需要一些指点和帮助，就如何整合这些开源包来解决用例？

For above use case to work what is best approach and how can I leverage on open source technology such as Apache Lucent, Apache Solr and Apache Nutch, and Apache Hadoop to solve this use case. You can thing of building search engine and with rule and keyword notification system. I just need some pointers and help on how to integrate these open source package to solve use case ?

任何帮助和指针将AP preciated。我们需要三个重要的组成部分是：

Any help and pointer will be appreciated. We need three important components are :


1) Web Crawler 
2) Index Creator  
3) Rule or keyword Mather

任何帮助将大大AP preciated。我指的这个wiki于一体的Nutch和Solr共同为上述目的 http://wiki.apache.org/nutch / RunningNutchAndSolr

推荐答案

你的问题是一个大的，但我会采取刺伤它，因为我已经设计和实施过这样的系统。

Your question is a big one but I'll take a stab at it as I've designed and implemented systems like this before.

忽略用户帐户管理，你的系统需要提供的手段：

Ignoring user account management, your system will need to provide the means to:

检索新局面数据（蜘蛛）

retrieve new prospect data (web spider)

识别和提取前景相关的数据结果（过滤）

identify and extract pertinent results from prospect data (filtering)

收集，维护和整理结果（存储）

collect, maintain and organize results (storage)

根据各种元数据选择结果（查询）

select results based on various metadata (querying)

格式结果交付给用户（模板）

format results for delivery to users (templating)

提供格式化的结果给用户（交付）

deliver formatted results to users (delivery)

如果你的项目的范围小（即少于100个站点需要每天蜘蛛），你很可能相处的很多开源的Web蜘蛛包括wget的，Nutch的，WebSphinx，人们等您可能需要提供仪器（自定义软件）进行调度，监控和控制。如果你的项目范围比这个更大的，则可能需要滚你自己的蜘蛛解决方案（定制软件）。典型地，这将被设计成一个分布式，并行结构

If the scope of your project is small (say less than 100 sites requiring spidering per day), you could probably get along with one of the many open-source web spiders including wget, Nutch, WebSphinx, etc. You might need to provide instrumentation (custom software) for scheduling, monitoring and control. If your project scope is larger than this, you may need to "roll your own" spidering solution (custom software). Typically this would be designed as a distributed, parallel architecture.

对于简单过滤，经常EX pressions就足够了，但对于需要的HTML布局知识更复杂的任务（抽取第五个列表元素的文本组件（＆LT;李/＆GT; ））你需要使用一个XHTML解析器。然而，你继续，你需要提供定制软件根据用户的需要进行过滤。


For simple filtering, regular expressions would suffice but for more complex tasks requiring knowledge of HTML layout (extract the textual component of the fifth list element (<LI/>) of the fourth table on the page) you'd need to use an XHTML parser. However you proceed, you'll need to provide custom software to conduct filtering based on your users' needs.
在任何数据库技术，可用于存储结果，从检索文件中提取，使用像Apache Solr实现文本优化引擎可以让你轻松地扩展您的搜索条件您的需求决定。由于SOLR支持的附件，并搜索与每个文件相关的元数据，这将是一个不错的选择。您还需要在这里提供定制软件来自动化此步骤。
While any database technology can be used to store results extracted from retrieved documents, using an engine optimized for text like Apache SOLR will allow you to easily expand your search criteria as your needs dictate. Since SOLR supports the attachment of and search for metadata associated with each document, it would be a good choice. You'll also need to provide custom software here to automate this step.
一旦你选择从SOLR候选结果列表，任何脚本语言可以用来模板成一个或多个电子邮件，也将注入到您的邮件传输代理（MTA）。这也需要定制软件自动完成这一过程（和如果需要的话，以用户特定的数据注入到每个消息）。
Once you've selected a list of candidate results from SOLR, any scripting language could be used to template them into one or more emails and would also inject them into your mail transport agent (MTA). This also requires custom software to automate this process (and if required, to inject user-specific data into each message).

                        这篇关于设计问题的通知系统的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

设计问题的通知系统 [英] Design Question for Notification System

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

设计问题的通知系统 [英] Design Question for Notification System

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭