php蜘蛛在中间打断(Domdocument,xpath,curl) - 需要帮助 [英] php spider breaks in middle (Domdocument, xpath, curl) - help needed

查看:167
本文介绍了php蜘蛛在中间打断(Domdocument,xpath,curl) - 需要帮助的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一个初学者程序员,设计一个抓取页面的蜘蛛。逻辑如下:

I am a beginner programmer, designing a spider that crawls pages. Logic goes like this:


  • get $ url with curl

  • 创建dom文档

  • 使用xpath解析href标记

  • 在$ totalurls中存储href属性(尚未存在)

  • 更新$ url from $ totalurls

  • get $url with curl
  • create dom document
  • parsing out href tags using xpath
  • storing href attributes in $totalurls (that aren't already there)
  • updating $url from $totalurls

问题是,在第10个抓取页面后,蜘蛛说它在页面上没有找到任何链接,

Problem is that after the 10th crawled page the spider says it does not find ANY links on the page, no no one on the next, and so on.

但是如果我从上一个例子中的第10页开始,它找到所有没有问题的链接,但是在10之后再次断开urls crawled。

But if I begin with the page that was 10th in previous example it finds all links with no problem but breaks again after 10 urls crawled.

任何想法可能会导致什么?我的猜测是与domdocument,也许,我不是100%熟悉的东西。还是可以存储太多数据导致麻烦?它可以是一些真正的初学者问题,因为我是全新的 - 和无能为力。请给我一些建议寻找问题

Any idea what might cause this? My guess is something with domdocument, maybe, I am not 100%familiar with that. Or can storing too much data cause trouble? It can be some really beginner issue cause i am brand new - AND clueless. Please give me some advice where to look for problem

推荐答案

我的猜测是你的脚本在30或60秒后超时对于php),可以用 set_time_limit($ num_of_seconds); 覆盖,或者您可以更改您的PHP中的 max_execution_time ini或者如果你有一个主机,你可以通过php设置(或任何它称为)更改一些值。

My guess is that your script times out after 30 or 60 seconds (default for php) which can be overridden with set_time_limit($num_of_seconds); or you can change your max_execution_time in your php.ini or if you have a hosting you can change some values via php settings(or whatever it is called).

也可以将它添加到你的page:

Also you might want to add this to the top of your page:

error_reporting(E_ALL);
ini_set("display_errors", 1);

并检查错误日志,看看是否有与您的蜘蛛有关的邮件。

and check your error logs to see if there are messages that pertain to your spider.

这篇关于php蜘蛛在中间打断(Domdocument,xpath,curl) - 需要帮助的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆