无法从网站上抓取内容 [英] unable to scrape content from a website

查看:108
本文介绍了无法从网站上抓取内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从网站上抓取一些内容,但是下面的代码不起作用(未显示任何输出). 这是代码

I am trying to scrap some content from a website but the code below is not working(not showing any output). here is the code

$url="some url";
$otherHeaders="";   //here i am using some other headers like content-type,userAgent,etc
some curl to get the webpage
...
..
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
$content=curl_exec($ch);curl_close($ch);

$page=new DOMDocument();
$xpath=new DOMXPath($page); 
$content=getXHTML($content);  //this is a tidy function to convert bad html to xhtml 
$page->loadHTML($content);    // its okay till here when i echo $page->saveHTML the page is displayed

$path1="//body/table[4]/tbody/tr[3]/td[4]";
$path2="//body/table[4]/tbody/tr[1]/td[4]";

$item1=$xpath->query($path1);
$item2=$xpath->query($path2);

echo $item1->length;      //this shows zero 
echo $item2->length;      //this shows zero

foreach($item1 as $t)
echo $t->nodeValue;    //doesnt show anything
foreach($item2 as $p)
echo $p->nodeValue;    //doesnt show anything

我确定上面的xpath代码有问题. xpaths是正确的.我已经用FirePath (a firefox addon)检查了上面的xpaths.我知道我在这里错过了一些非常愚蠢的东西,但是我看不出来.请帮忙. 我已经检查了类似的代码来抓取Wikipedia中的链接(肯定是xpaths是不同的),并且效果很好. 所以我不明白为什么上面的代码对其他URLs不起作用.我正在用Tidy清理HTML内容,所以我不认为xpath无法正确处理HTML是没有问题的吗? 我已经检查了$item1=$xpath->query($path1)之后的nodelist的长度,这是0,这意味着$xpath->query出了问题,因为xpaths是正确的,因为我已经用FirePath检查了 我已经指出了一些修改代码,并使用了loadXML而不是loadHTML. 但这给了我Entity 'nbsp' not defined in Entity错误,因此我使用了libxml选项LIBXML_NOENT来替换实体,但错误仍然存​​在.

i am sure there is something wrong with the above xpath code. the xpaths are correct. I have checked the above xpaths with FirePath (a firefox addon). I know i am missing something very silly here but i cant make out. Please help. I have checked similar code for scraping links from Wikipedia(definitely the xpaths are different) and it works nicely. So i dont understand why the above code does not work for the other URLs. I am cleaning the HTML content with Tidy so i dont there is a problem with xpath not geeting the HTML right? i have checked the length of the nodelist after $item1=$xpath->query($path1) which is 0 which means something is going wrong with $xpath->query because the xpaths are correct as i have checked with FirePath I have modified my code a bit as pointed out and used loadXML instead of loadHTML. but this gives me error as Entity 'nbsp' not defined in Entity so i used the libxml option LIBXML_NOENT to substitute entities but still the errors remain.

推荐答案

这个问题提醒我,很多时候,解决问题的办法是简单而不是复杂.我正在尝试namespaceserror corrections等,但是解决方案只要求仔细检查代码. 我的代码的问题是loadHTML()xpath initialization的顺序.最初的订单是

This question reminds me that a lot of times the solution to a problem lies in simplicity and not complications. i was trying namespaces,error corrections,etc but the solution just demanded close inspection of the code. the problem with my code was the order of loadHTML() and xpath initialization. initially the order was

$xpath=new DOMXPath($page);
$page->loadHTML($content);

通过执行

,我实际上是在一个空文档上初始化xapth.现在通过先加载htmlhtml然后初始化xpath来反转顺序,这样我就能获得所需的结果.还建议通过将tbody元素从xpath移除为firefox来自动插入它.因此正确的xpath应该是

by doing this i was actually initializing xapth on an empty document. now reversing the order by first loading the dom with the html and then initializing the xpath i was able to get the desired results. Also as suggested that by removing the tbodyelement from xpath as firefox automatically inserts it. so the correct xpath should be

$path1="//body/table[4]/tr[3]/td[4]";
$path2="//body/table[4]/tr[1]/td[4]";

感谢大家的建议和支持.

thanks to everyone for their suggestions and bearing this.

这篇关于无法从网站上抓取内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆