有人知道我可以使用的基于 Python 的网络爬虫吗? [英] Anyone know of a good Python based web crawler that I could use?

查看:35
本文介绍了有人知道我可以使用的基于 Python 的网络爬虫吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有点想自己写,但我现在真的没有足够的时间.我看过开源爬虫的维基百科列表,但我更喜欢用 Python 写的东西.我意识到我可能只使用维基百科页面上的工具之一并将其包装在 Python 中.我最终可能会这样做 - 如果有人对这些工具中的任何一个有任何建议,我愿意听取他们的意见.我通过其 Web 界面使用了 Heritrix,但我发现它非常麻烦.我绝对不会在即将开展的项目中使用浏览器 API.

I'm half-tempted to write my own, but I don't really have enough time right now. I've seen the Wikipedia list of open source crawlers but I'd prefer something written in Python. I realize that I could probably just use one of the tools on the Wikipedia page and wrap it in Python. I might end up doing that - if anyone has any advice about any of those tools, I'm open to hearing about them. I've used Heritrix via its web interface and I found it to be quite cumbersome. I definitely won't be using a browser API for my upcoming project.

提前致谢.另外,这是我的第一个 SO 问题!

Thanks in advance. Also, this is my first SO question!

推荐答案

  • 机械化是我的最爱;强大的高级浏览功能(超级简单的表单填写和提交).
  • Twill 是一种建立在 Mechanize 之上的简单脚本语言
  • BeautifulSoup + urllib2 也很好用.
  • Scrapy 看起来是一个非常有前途的项目;这是新的.
    • Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission).
    • Twill is a simple scripting language built on top of Mechanize
    • BeautifulSoup + urllib2 also works quite nicely.
    • Scrapy looks like an extremely promising project; it's new.
    • 这篇关于有人知道我可以使用的基于 Python 的网络爬虫吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆