Erlang是网络抓取工具的正确选择吗? [英] Is Erlang the right choice for a webcrawler?
问题描述
用于抓取工具的语言和平台必须符合以下条件:
- 可轻松扩展多个内核和cpus
- 适合高I / O加载
- 快速正则表达式匹配
- 轻松维护/少量运营开销
经过一番研究,我认为Erlang可能会一个适合的候选人,但我看到它不是很好的字符串处理(所以正则表达式匹配)。我也没有关于维护因素的任何期限。
Erlang是上述情景的好技术吗?如果没有,什么是一个很好的选择?
我也在评估erlang用作网络爬虫,它看起来
目前有很多有用的模块: HTML解析器, HTTP客户端, XPath , regex ,缓存。
和其他 people 对相同的用例感兴趣,所以你可以从他们那里学习。
但是,如果这只是一个关闭项目,我建议使用Python / Ruby / Perl,因为它会更容易入门。
I am planning to write a webcrawler for a NLP project, that reads in the thread structure of a forum everytime in a specific interval and parses each thread with new content. Via regular expressions, the author, the date and the content of new posts is extracted. The result is then stored in a database.
The language and plattform used for the crawler have to match the following criteria:
- easily scalable on multiple cores and cpus
- suited for high I/O loads
- fast regular expression matching
- easily to maintain/few operational overhead
After some research I think Erlang might be a fitting candidate, but I read it's not very good at string processing (and so regular expression matching). Neither do I have any expirience about the maintenance factor.
Is Erlang a good technology for the scenario described above? And if not, what would be a good alternative?
I am also evaluating erlang for use as a web crawler and it looks good so far.
There are lots of existing helpful modules: HTML parser, HTTP client, XPath, regex, cache.
And other people are interested in the same use case, so you can learn from them.
However if this is just a one off project I recommend Python / Ruby / Perl because it will be easier to get started with.
这篇关于Erlang是网络抓取工具的正确选择吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!