如何最好地开发网络爬虫 [英] How to best develop web crawlers
问题描述
我习惯于创建一些爬虫来编译信息,当我来到一个网站时,我需要这些信息,我启动了一个特定于该站点的新爬虫,大部分时间使用 shell 脚本,有时使用 PHP.
I am used to create some crawlers to compile information and as I come to a website I need the info I start a new crawler specific for that site, using shell scripts most of the time and sometime PHP.
我的做法是用一个简单的for
来迭代页面列表,一个wget
下载它并sed
,tr
、awk
或其他实用程序来清理页面并获取我需要的特定信息.
The way I do is with a simple for
to iterate for the page list, a wget
do download it and sed
, tr
, awk
or other utilities to clean the page and grab the specific info I need.
所有过程都需要一些时间,具体取决于站点,下载所有页面需要更多时间.我经常进入一个让一切都复杂化的 AJAX 网站
All the process takes some time depending on the site and more to download all pages. And I often steps into an AJAX site that complicates everything
我想知道是否有更好的方法、更快的方法或什至一些应用程序或语言来帮助此类工作.
I was wondering if there is better ways to do that, faster ways or even some applications or languages to help such work.
推荐答案
使用正则表达式来解析内容是一个坏主意,这里的问题已经被无数次提及.
Using regular expressions for parsing content is a bad idea that has been covered in questions here countless times.
您应该将文档解析为 DOM 树,然后您可以拉出任何您想要的超链接、样式表、脚本文件、图像或其他外部链接,并相应地遍历它们.
You should be parsing the document into a DOM tree and then you can pull out any hyperlinks, stylesheets, script files, images or other external links that you want and traverse them accordingly.
许多脚本语言都有用于获取网页(例如用于 PHP 的 curl)和用于解析 HTML(例如用于 Python 的 Beautiful Soup)的包.走那条路,而不是正则表达式匹配的笨拙解决方案.
Many scripting languages have packages for getting Web pages (eg curl for PHP) and for parsing HTML (eg Beautiful Soup for Python). Go that route instead of the hackky solution of regular expression matching.
这篇关于如何最好地开发网络爬虫的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!