使用curl和xpath爬网网站 [英] Use curl and xpath to crawl website

查看:88
本文介绍了使用curl和xpath爬网网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取此站点并获得表格 http://www.basketligaen.dk/da/top/turnering/stilling/,但是当我尝试获取内容时,我得到了 DOMNodeList对象([length] => 0).我的代码如下:

I want to crawl this site and get the table standing http://www.basketligaen.dk/da/top/turnering/stilling/, but when I try to get the content I get DOMNodeList Object ( [length] => 0 ) . My code looks like this:

    $curl = curl_init('http://www.basketligaen.dk/da/top/turnering/stilling/');
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10');
    $html = curl_exec($curl);
    curl_close($curl);

    $doc = new DOMDocument();
    $doc->loadHTML($html);
    $xpath = new DOMXpath($doc);
    $elements = $xpath->query("//div[@id='3739']/table");
    print_r($elements);

我之前已经爬过很多页面,但是我找不到这个页面的问题-有人可以看到我在做什么吗?

I have crawled a lot of pages before, but I cant find the problem with this one - is there someone who can see what I am doing wrong?

推荐答案

div 元素正下方没有带有 id ="3739"的 table 元素.

There is no table element directly under the div element with id="3739".

该表位于 div 元素下面,且带有 id ="3738" ,但不能直接使用,这应该可以:

The table is under the div element with id="3738" and not directly, this should work:

//div[@id='3738']//table

请注意,双斜线表示父子关系,但不限深度.

Note the double-slash which means a parent-children relationship, but at any depth level.

作为一个旁注,我不特别喜欢当前XPath表达式的可读性和鲁棒性- 3738 id有点神秘",它不会带来任何有价值的数据,面向信息,并且很有可能被更改.也许,更好的方法是依靠表头:

As a side note, I don't particularly like the readability and the robustness of the current XPath expression - the 3738 id is kind of "cryptic", it does not bring any valuable data-oriented information and has a high chance to be changed. Probably, a better way would be to rely on the table header:

//div[. = 'Grundspil']/following-sibling::table


总而言之,这里存在一个更大的问题-该表是JavaScript小部件"的一部分,并由您的浏览器及其JavaScript引擎动态配置和加载.当您使用"curl"下载页面时,只会得到一个非常初始的HTML页面,其中不包含所需的表.


All that said, there is a bigger problem here - the table is a part of a JavaScript "widget" and is configured and loaded dynamically by your browser and it's JavaScript engine. When you download your page with "curl", you only get the very initial HTML page which does not contain the desired table.

最简单的解决方法之一(例如在实现方面)是通过例如 .上面提到的有关XPath表达式的要点仍然适用,因为除其他外,还有"by xpath"定位符.

One of the easiest (in terms of implementation) workarounds, would be to automate a real browser via, for example, selenium. The points about the XPath expressions made above would still be applicable since, among others, there is also the "by xpath" locator.

这篇关于使用curl和xpath爬网网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆