如何处理Google网上论坛讨论抓取工具 [英] How to approach Google groups discussions crawler

查看:100
本文介绍了如何处理Google网上论坛讨论抓取工具的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为RSS的练习,我希望能够搜索该组上几乎所有的Unix讨论.

comp.unix.shell

我对Python足够了解,并且了解基本的RSS,但是我坚持……如何获取特定日期之间的所有消息,或者至少是最近的Nth和最近的Mth之间的所有消息?

高级描述,欢迎使用伪代码.

谢谢!

我希望能够返回100条以上的消息,但不希望像使用此URL那样一次解析10条消息:

http://groups.google.com/group/comp.unix.shell/topics?hl=zh-CN&start=2000&sa=N

必须有更好的方法.

解决方案

正如Randal所述,这违反了Google的ToS,但是,作为假设或在没有这些限制的情况下在其他网站上使用,您可以很容易地使用 urllib comp.unix.shell

I know enough Python and understand basic RSS, but I am stuck on ... how do I grab all messages between particular dates, or at least all messages between Nth recent and Mth recent?

High level descriptions, pseudo-code is welcome.

Thank you!

EDIT:

I would like to be able to go back more than 100 messages, but do not grabbing like parsing 10 messages at a time such as using this URL:

http://groups.google.com/group/comp.unix.shell/topics?hl=en&start=2000&sa=N

There must be a better way.

解决方案

As Randal mentioned, this violates Google's ToS -- however, as a hypothetical or for use on another site without these restrictions you could pretty easily rig something up with urllib and BeautifulSoup. Use urllib to open the page and then use BeautifulSoup to grab all the thread topics (and links if you want to crawl deeper). You can then programmatically find the link to the next page of results and then make another urllib to go to page 2 -- then repeat the process.

At this point you should have all the raw data, then it is just a matter of manipulating the data and implementing your searching functionality.

这篇关于如何处理Google网上论坛讨论抓取工具的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆