Erlang是网络抓取工具的正确选择吗？ [英] Is Erlang the right choice for a webcrawler?

查看：116 发布时间：2017/8/27 12:19:37 erlang web-crawler

本文介绍了Erlang是网络抓取工具的正确选择吗？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我打算为一个NLP项目编写一个网页抓取工具，每个时间都会在一个特定的时间间隔内读入一个论坛的线程结构，并用新的内容解析每个线程。通过正则表达式，提取作者，新帖子的日期和内容。结果存储在数据库中。

用于抓取工具的语言和平台必须符合以下条件：

可轻松扩展多个内核和cpus

适合高I / O加载

快速正则表达式匹配

轻松维护/少量运营开销

经过一番研究，我认为Erlang可能会一个适合的候选人，但我看到它不是很好的字符串处理（所以正则表达式匹配）。我也没有关于维护因素的任何期限。

Erlang是上述情景的好技术吗？如果没有，什么是一个很好的选择？

解决方案

我也在评估erlang用作网络爬虫，它看起来

目前有很多有用的模块： HTML解析器， HTTP客户端， XPath ， regex ，缓存。

和其他 people 对相同的用例感兴趣，所以你可以从他们那里学习。

但是，如果这只是一个关闭项目，我建议使用Python / Ruby / Perl，因为它会更容易入门。

I am planning to write a webcrawler for a NLP project, that reads in the thread structure of a forum everytime in a specific interval and parses each thread with new content. Via regular expressions, the author, the date and the content of new posts is extracted. The result is then stored in a database.

The language and plattform used for the crawler have to match the following criteria:

easily scalable on multiple cores and cpus
suited for high I/O loads
fast regular expression matching
easily to maintain/few operational overhead

After some research I think Erlang might be a fitting candidate, but I read it's not very good at string processing (and so regular expression matching). Neither do I have any expirience about the maintenance factor.

Is Erlang a good technology for the scenario described above? And if not, what would be a good alternative?

解决方案

I am also evaluating erlang for use as a web crawler and it looks good so far.

There are lots of existing helpful modules: HTML parser, HTTP client, XPath, regex, cache.

And other people are interested in the same use case, so you can learn from them.

However if this is just a one off project I recommend Python / Ruby / Perl because it will be easier to get started with.

这篇关于Erlang是网络抓取工具的正确选择吗？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Erlang是网络抓取工具的正确选择吗？ [英] Is Erlang the right choice for a webcrawler?

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

Erlang是网络抓取工具的正确选择吗？ [英] Is Erlang the right choice for a webcrawler?

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭