如何处理Google网上论坛讨论抓取工具 [英] How to approach Google groups discussions crawler
问题描述
作为RSS的练习,我希望能够搜索该组上几乎所有的Unix讨论.
我对Python足够了解,并且了解基本的RSS,但是我坚持……如何获取特定日期之间的所有消息,或者至少是最近的Nth和最近的Mth之间的所有消息?
高级描述,欢迎使用伪代码.
谢谢!
我希望能够返回100条以上的消息,但不希望像使用此URL那样一次解析10条消息:
http://groups.google.com/group/comp.unix.shell/topics?hl=zh-CN&start=2000&sa=N
必须有更好的方法.
正如Randal所述,这违反了Google的ToS,但是,作为假设或在没有这些限制的情况下在其他网站上使用,您可以很容易地使用 urllib 和comp.unix.shell
I know enough Python and understand basic RSS, but I am stuck on ... how do I grab all messages between particular dates, or at least all messages between Nth recent and Mth recent?
High level descriptions, pseudo-code is welcome.
Thank you!
EDIT:
I would like to be able to go back more than 100 messages, but do not grabbing like parsing 10 messages at a time such as using this URL:
http://groups.google.com/group/comp.unix.shell/topics?hl=en&start=2000&sa=N
There must be a better way.
As Randal mentioned, this violates Google's ToS -- however, as a hypothetical or for use on another site without these restrictions you could pretty easily rig something up with urllib and BeautifulSoup. Use urllib to open the page and then use BeautifulSoup to grab all the thread topics (and links if you want to crawl deeper). You can then programmatically find the link to the next page of results and then make another urllib to go to page 2 -- then repeat the process.
At this point you should have all the raw data, then it is just a matter of manipulating the data and implementing your searching functionality.
这篇关于如何处理Google网上论坛讨论抓取工具的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!