编写程序以抓取论坛 [英] Writing a program to scrape forums

查看:100
本文介绍了编写程序以抓取论坛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要编写一个程序来刮取论坛.

I need to write a program to scrape forums.

我应该使用Scrapy框架在Python中编写程序还是应该使用Php cURL? 还有相当于Scrapy的Php吗?

Should I write the program in Python using the Scrapy framework or should I use Php cURL? Also is there a Php equivalent to Scrapy?

谢谢

推荐答案

由于libxml2的高级绑定,我会选择Python,尤其是 lxml.html pyQuery . Scrapy有其自己的libxml2绑定,虽然没有浏览Scrapy文档并没有给我留下深刻的印象(我仅使用这些解析器和手动编码就做了很多抓取工作),但我没有看过它们来测试它们.有了这些,您将获得真正出色的HTML解析器,通过XPath进行查询,并使用lxml.html和pyquery(也是基于lxml构建)获得CSS选择器.

I would choose Python due to superior libxml2 bindings, specifically things like lxml.html and pyQuery. Scrapy has its own libxml2 bindings, I haven't looked at them to test them, though skimming the Scrapy documentation didn't leave me very impressed (I've done lots of scraping just using these parsers and manual coding). With any of these you get a truly superior HTML parser, querying via XPath, and with lxml.html and pyquery (also built on lxml) you get CSS selectors.

如果您只是在抓取论坛上做些小工作,那么我将跳过一个抓取框架,而只是手工完成-这很容易,并且不需要真正实现并行化.

If you are doing a small job scraping a forum, I'd skip a scraping framework and just do it by hand -- it's easy and parallelizing etc is not really needed.

这篇关于编写程序以抓取论坛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆