问题与多卷曲和simplehtmldom,只抓头? [英] problem with multi curl and simplehtmldom, grabbing only header?

查看:102
本文介绍了问题与多卷曲和simplehtmldom,只抓头?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用simple curl with simplehtmldom



我正在simplehtmldom上阅读本手册: http://simplehtmldom.sourceforge.net/manual_faq.htm#hosting ,这个例子是使用curl来抓取1个网站,我试图抓住多个我'使用多卷曲。



但是当我尝试使用simplehtmldom的多卷曲时,我从页面的标题部分得到一个错误,并显示我在哪里出现错误第34行simple_html_dom.php

  $ dom-> load(call_user_func_array('file_get_contents',$ args),true); 

从这里

 code> // get html dom form file 
function file_get_html(){
$ dom = new simple_html_dom;
$ args = func_get_args();
$ dom-> load(call_user_func_array('file_get_contents',$ args),true);
return $ dom;
}

这是我的多卷曲脚本。

  $ urls = array(
http://www.html2.com,// $ res [0]
http: //www.html1.com// $ res [1]
);

$ mh = curl_multi_init();

foreach($ urls as $ i => $ url){
$ conn [$ i] = curl_init($ url);
curl_setopt($ conn [$ i],CURLOPT_RETURNTRANSFER,1); //返回数据为字符串
curl_setopt($ conn [$ i],CURLOPT_FOLLOWLOCATION,1); //关注重定向
curl_setopt($ conn [$ i],CURLOPT_MAXREDIRS,2); //最大重定向
curl_setopt($ conn [$ i],CURLOPT_CONNECTTIMEOUT,10); // timeout
curl_multi_add_handle($ mh,$ conn [$ I]);
}

do {$ n = curl_multi_exec($ mh,$ active); } while($ active);

foreach($ urls as $ i => $ url){
$ res [$ i] = curl_multi_getcontent($ conn [$ i]);
curl_multi_remove_handle($ mh,$ conn [$ i]);
curl_close($ conn [$ i]);

}
curl_multi_close($ mh);

我使用这个

  $ html = file_get_html($ res [0]); 

请帮助我!



谢谢

解决方案

您可能会遇到的错误:

 警告:file_get_contents():第39行的/tmp/simple_html_dom.php中的文件名不能为空

这告诉你,由于某些原因,你传入file_get_html()($ res [0])的内容是空的 - 很可能是因为需要一些额外的/不同的CURL参数。实际上,如果你在循环中回应$ res [$ i],你会看到这个。



一旦你解决了,你会有另一个问题 - 你正在尝试将刚刚抓到的html内容传递给file_get_html(),该文件期待某种文件路径,而不是内容。实际上,file_get_contents可以从一个标准的url中拉出来,所以如果file_get_contents能够正确地提取你的数据,你可以完全跳过所有的curl东西。



如果你想保持卷曲调用,那么你应该将$ res [0]传递给str_get_html(),而不是file_get_html()。


I'm using multi curl with simplehtmldom

i was reading this manual on simplehtmldom: http://simplehtmldom.sourceforge.net/manual_faq.htm#hosting and the example is using curl to grab 1 website, i'm trying to grab multiple which i'm using multi curl.

But when I tried using my multi curl with simplehtmldom, I'm getting an error from the header part of the page and it shows me where there's an error which is at line 39 of simple_html_dom.php

    $dom->load(call_user_func_array('file_get_contents', $args), true);

from here

// get html dom form file
function file_get_html() {
    $dom = new simple_html_dom;
    $args = func_get_args();
    $dom->load(call_user_func_array('file_get_contents', $args), true);
    return $dom;
}

This is my multi curl script.

$urls = array(
   "http://www.html2.com", //$res[0]
   "http://www.html1.com" //$res[1]
   );

$mh = curl_multi_init();

foreach ($urls as $i => $url) {
       $conn[$i]=curl_init($url);
       curl_setopt($conn[$i],CURLOPT_RETURNTRANSFER,1);//return data as string 
       curl_setopt($conn[$i],CURLOPT_FOLLOWLOCATION,1);//follow redirects
       curl_setopt($conn[$i],CURLOPT_MAXREDIRS,2);//maximum redirects
       curl_setopt($conn[$i],CURLOPT_CONNECTTIMEOUT,10);//timeout
       curl_multi_add_handle ($mh,$conn[$i]);
}

do { $n=curl_multi_exec($mh,$active); } while ($active);

foreach ($urls as $i => $url) {
       $res[$i]=curl_multi_getcontent($conn[$i]);
       curl_multi_remove_handle($mh,$conn[$i]);
       curl_close($conn[$i]);

}
curl_multi_close($mh);

and I used this

$html = file_get_html($res[0]);

Help me please!

thank you

解决方案

The error you are getting is likely:

Warning: file_get_contents(): Filename cannot be empty in /tmp/simple_html_dom.php on line 39

That tells you that what you are passing into file_get_html() ($res[0]) is empty for some reason - mostly likely due to needing some additional/different CURL parameters. Indeed, if you echo out the $res[$i] in your loop, you'll see that.

Once you fix that, you'll have another problem - you are trying to pass the html content you just scraped into file_get_html() which is expecting some sort of file path, not content. In fact, file_get_contents can pull from a standard url, so you could skip all of the curl stuff completely if file_get_contents is able to pull your data correctly.

If you want to keep the curl calls, then you should be passing $res[0] into str_get_html(), not file_get_html().

这篇关于问题与多卷曲和simplehtmldom,只抓头?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆