网络爬虫 - Python还是Perl? [英] Web Crawler - Python or Perl?

查看:120
本文介绍了网络爬虫 - Python还是Perl?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,

我目前正计划编写自己的网络抓取工具。我知道Python但是没有Perl,而且我很有兴趣知道这两个中哪一个是更好的选择,给出以下场景:


1)I / O问题:我在资源方面的最大限制是

带宽节流颈。

2)效率问题:爬行器必须是快速,强大且具有内存效率和内存效率。尽可能。我正在运行我的所有爬行器

便宜的pcs,大约500 MB RAM和P3到P4处理器

3)兼容性问题:大多数这些爬虫将在Unix上运行

(FreeBSD),所以应该存在一个非常好的编译器,可以在环境下优化我的代码。


什么你的意见是什么?

解决方案

6月9日晚上11点48分,消失... @ gmail.com写道:


大家好,

我目前正计划编写自己的网络抓取工具。我知道Python但是没有Perl,而且我很有兴趣知道这两个中哪一个是更好的选择,给出以下场景:


1)I / O问题:我在资源方面的最大限制是

带宽节流颈。

2)效率问题:爬行器必须是快速,强大且具有内存效率和内存效率。尽可能。我正在运行我的所有爬行器

便宜的pcs,大约500 MB RAM和P3到P4处理器

3)兼容性问题:大多数这些爬虫将在Unix上运行

(FreeBSD),所以应该存在一个非常好的编译器,可以在环境下优化我的代码。


什么你的意见?



无论你使用Perl还是Python编写

网页抓取工具都没关系。我用它们来编写爬虫。你提到的场景(I / O问题,效率,兼容性)对于这两种语言来说并没有两个不同。这两种语言都具有快速I / O.你可以在
Python中使用urllib2模块和/或漂亮的汤来开发爬虫。对于Perl,您可以使用Mechanize或LWP模块。两种语言

都对正则表达式有很好的支持。 Perl稍快一点我听说过b $ b,虽然我自己找不到差别。两者都与* nix兼容

。为了写一个好的爬虫,语言并不重要,这是重要的技术。


问候,

Subeen。
http://love-python.blogspot.com/


di ******* **** @gmail.com 写道:


1)I / O问题:我在资源方面的最大限制将是

带宽节流阀颈。

2)效率问题:爬行器必须快速,稳健且具有内存效率等特性。尽可能。我正在运行我的所有爬行器

便宜的pcs,大约500 MB RAM和P3到P4处理器

3)兼容性问题:大多数这些爬虫将在Unix上运行

(FreeBSD),所以应该存在一个非常好的编译器,可以在这些环境下优化我的代码。



您应该重新考虑您的要求。你希望受到I / O限制,那么为什么你需要一个好的编译器?

?特别是在询问两种解释的

语言时...


考虑使用lxml(使用Python),它几乎拥有你需要的一切

a网络爬虫,支持直接从HTTP URL进行线程解析,并且它足够快速且非常节省内存。

http://codespeak.net/lxml/

Stefan


subeen写道:


可以使用urllib2模块和/或漂亮的汤来开发爬虫


如果你关心a)速度和/或b)记忆效率,那就没有了。

http://blog.ianbicking.org/2008/03/3...r-performance/


Stefan


Hi all,
I am currently planning to write my own web crawler. I know Python but
not Perl, and I am interested in knowing which of these two are a
better choice given the following scenario:

1) I/O issues: my biggest constraint in terms of resource will be
bandwidth throttle neck.
2) Efficiency issues: The crawlers have to be fast, robust and as
"memory efficient" as possible. I am running all of my crawlers on
cheap pcs with about 500 mb RAM and P3 to P4 processors
3) Compatibility issues: Most of these crawlers will run on Unix
(FreeBSD), so there should exist a pretty good compiler that can
optimize my code these under the environments.

What are your opinions?

解决方案

On Jun 9, 11:48 pm, disappeare...@gmail.com wrote:

Hi all,
I am currently planning to write my own web crawler. I know Python but
not Perl, and I am interested in knowing which of these two are a
better choice given the following scenario:

1) I/O issues: my biggest constraint in terms of resource will be
bandwidth throttle neck.
2) Efficiency issues: The crawlers have to be fast, robust and as
"memory efficient" as possible. I am running all of my crawlers on
cheap pcs with about 500 mb RAM and P3 to P4 processors
3) Compatibility issues: Most of these crawlers will run on Unix
(FreeBSD), so there should exist a pretty good compiler that can
optimize my code these under the environments.

What are your opinions?

It really doesn''t matter whether you use Perl or Python for writing
web crawlers. I have used both for writing crawlers. The scenarios you
mentioned (I/O issues, Efficiency, Compatibility) don''t differ two
much for these two languages. Both the languages have fast I/O. You
can use urllib2 module and/or beautiful soup for developing crawler in
Python. For Perl you can use Mechanize or LWP modules. Both languages
have good support for regular expressions. Perl is slightly faster I
have heard, though I don''t find the difference myself. Both are
compatible with *nix. For writing a good crawler, language is not
important, it''s the technology which is important.

regards,
Subeen.
http://love-python.blogspot.com/


di***********@gmail.com wrote:

1) I/O issues: my biggest constraint in terms of resource will be
bandwidth throttle neck.
2) Efficiency issues: The crawlers have to be fast, robust and as
"memory efficient" as possible. I am running all of my crawlers on
cheap pcs with about 500 mb RAM and P3 to P4 processors
3) Compatibility issues: Most of these crawlers will run on Unix
(FreeBSD), so there should exist a pretty good compiler that can
optimize my code these under the environments.

You should rethink your requirements. You expect to be I/O bound, so why do
you require a good "compiler"? Especially when asking about two interpreted
languages...

Consider using lxml (with Python), it has pretty much everything you need for
a web crawler, supports threaded parsing directly from HTTP URLs, and it''s
plenty fast and pretty memory efficient.

http://codespeak.net/lxml/

Stefan


subeen wrote:

can use urllib2 module and/or beautiful soup for developing crawler

Not if you care about a) speed and/or b) memory efficiency.

http://blog.ianbicking.org/2008/03/3...r-performance/

Stefan


这篇关于网络爬虫 - Python还是Perl?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆