Erlang是网络抓取工具的正确选择吗? [英] Is Erlang the right choice for a webcrawler?

查看:116
本文介绍了Erlang是网络抓取工具的正确选择吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我打算为一个NLP项目编写一个网页抓取工具,每个时间都会在一个特定的时间间隔内读入一个论坛的线程结构,并用新的内容解析每个线程。通过正则表达式,提取作者,新帖子的日期和内容。结果存储在数据库中。



用于抓取工具的语言和平台必须符合以下条件:




  • 可轻松扩展多个内核和cpus

  • 适合高I / O加载

  • 快速正则表达式匹配

  • 轻松维护/少量运营开销



经过一番研究,我认为Erlang可能会一个适合的候选人,但我看到它不是很好的字符串处理(所以正则表达式匹配)。我也没有关于维护因素的任何期限。



Erlang是上述情景的好技术吗?如果没有,什么是一个很好的选择?

解决方案

我也在评估erlang用作网络爬虫,它看起来



目前有很多有用的模块: HTML解析器 HTTP客户端 XPath regex 缓存



其他 people 对相同的用例感兴趣,所以你可以从他们那里学习。



但是,如果这只是一个关闭项目,我建议使用Python / Ruby / Perl,因为它会更容易入门。


I am planning to write a webcrawler for a NLP project, that reads in the thread structure of a forum everytime in a specific interval and parses each thread with new content. Via regular expressions, the author, the date and the content of new posts is extracted. The result is then stored in a database.

The language and plattform used for the crawler have to match the following criteria:

  • easily scalable on multiple cores and cpus
  • suited for high I/O loads
  • fast regular expression matching
  • easily to maintain/few operational overhead

After some research I think Erlang might be a fitting candidate, but I read it's not very good at string processing (and so regular expression matching). Neither do I have any expirience about the maintenance factor.

Is Erlang a good technology for the scenario described above? And if not, what would be a good alternative?

解决方案

I am also evaluating erlang for use as a web crawler and it looks good so far.

There are lots of existing helpful modules: HTML parser, HTTP client, XPath, regex, cache.

And other people are interested in the same use case, so you can learn from them.

However if this is just a one off project I recommend Python / Ruby / Perl because it will be easier to get started with.

这篇关于Erlang是网络抓取工具的正确选择吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆