无法从网站抓取内容 [英] unable to scrape content from a website

查看:30
本文介绍了无法从网站抓取内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从网站上删除一些内容,但下面的代码不起作用(未显示任何输出).这是代码

I am trying to scrap some content from a website but the code below is not working(not showing any output). here is the code

$url="some url";
$otherHeaders="";   //here i am using some other headers like content-type,userAgent,etc
some curl to get the webpage
...
..
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
$content=curl_exec($ch);curl_close($ch);

$page=new DOMDocument();
$xpath=new DOMXPath($page); 
$content=getXHTML($content);  //this is a tidy function to convert bad html to xhtml 
$page->loadHTML($content);    // its okay till here when i echo $page->saveHTML the page is displayed

$path1="//body/table[4]/tbody/tr[3]/td[4]";
$path2="//body/table[4]/tbody/tr[1]/td[4]";

$item1=$xpath->query($path1);
$item2=$xpath->query($path2);

echo $item1->length;      //this shows zero 
echo $item2->length;      //this shows zero

foreach($item1 as $t)
echo $t->nodeValue;    //doesnt show anything
foreach($item2 as $p)
echo $p->nodeValue;    //doesnt show anything

我确信上面的 xpath 代码有问题.xpaths 是正确的.我已经用 FirePath(一个 firefox 插件) 检查了上面的 xpaths.我知道我在这里遗漏了一些非常愚蠢的东西,但我无法弄清楚.请帮忙.我已经检查过类似的代码来从 Wikipedia 抓取链接(肯定 xpaths 是不同的)并且它运行良好.所以我不明白为什么上面的代码对其他 URLs 不起作用.我正在使用 Tidy 清理 HTML 内容,所以我不存在 xpath 无法获取 HTML 的问题,对吗?我检查了 $item1=$xpath->query($path1) 之后的 nodelist 的长度,它是 0 这意味着有些东西是$xpath->query 出错,因为 xpaths 是正确的,因为我已经用 FirePath 检查过我按照指出的那样修改了我的代码并使用了 loadXML 而不是 loadHTML.但这给了我错误,因为 Entity 'nbsp' not defined in Entity 所以我使用 libxml 选项 LIBXML_NOENT 来替换实体,但错误仍然存​​在.

i am sure there is something wrong with the above xpath code. the xpaths are correct. I have checked the above xpaths with FirePath (a firefox addon). I know i am missing something very silly here but i cant make out. Please help. I have checked similar code for scraping links from Wikipedia(definitely the xpaths are different) and it works nicely. So i dont understand why the above code does not work for the other URLs. I am cleaning the HTML content with Tidy so i dont there is a problem with xpath not geeting the HTML right? i have checked the length of the nodelist after $item1=$xpath->query($path1) which is 0 which means something is going wrong with $xpath->query because the xpaths are correct as i have checked with FirePath I have modified my code a bit as pointed out and used loadXML instead of loadHTML. but this gives me error as Entity 'nbsp' not defined in Entity so i used the libxml option LIBXML_NOENT to substitute entities but still the errors remain.

推荐答案

这个问题提醒我,很多时候问题的解决方案在于简单而不是复杂.我正在尝试命名空间纠错等,但解决方案只是要求仔细检查代码.我的代码的问题是 loadHTML()xpath 初始化 的顺序.最初的顺序是

This question reminds me that a lot of times the solution to a problem lies in simplicity and not complications. i was trying namespaces,error corrections,etc but the solution just demanded close inspection of the code. the problem with my code was the order of loadHTML() and xpath initialization. initially the order was

$xpath=new DOMXPath($page);
$page->loadHTML($content);

通过这样做,我实际上是在一个空文档上初始化 xapth.现在通过首先用 html 加载 dom 然后初始化 xpath 来反转顺序,我能够得到想要的结果.同样建议通过从 xpath 中删除 tbody 元素作为 firefox 自动插入它.所以正确的 xpath 应该是

by doing this i was actually initializing xapth on an empty document. now reversing the order by first loading the dom with the html and then initializing the xpath i was able to get the desired results. Also as suggested that by removing the tbodyelement from xpath as firefox automatically inserts it. so the correct xpath should be

$path1="//body/table[4]/tr[3]/td[4]";
$path2="//body/table[4]/tr[1]/td[4]";

感谢大家的建议和承担.

thanks to everyone for their suggestions and bearing this.

这篇关于无法从网站抓取内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆