使用Xpath从Amazon获取价格 [英] Getting price from Amazon with Xpath

查看:138
本文介绍了使用Xpath从Amazon获取价格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在以下页面中:



http://www.amazon.com/Jessica-Simpson-Womens-Double-Breasted/dp/B00K65ZMCA/ref=sr_1_4_mc/185-0705108-6790969?s=apparel&ie=UTF8&amp ; qid = 1413083859& sr = 1-4
我试图用表达式

 '// span [@ id = priceblock_ourprice]'

但结果为空



有趣的部分是,在其他亚马逊页面中,像这样的
http://www.amazon.com/SanDisk-Cruzer-Frustration-Free-Packaging--SDCZ36-032G-AFFP/dp/ B007JR532M / ref = sr_1_1?s = pc& ie = UTF8& qid = 1413084653& sr = 1-1& keywords = usb



我确实有一个有效的表达式

 '// b [@ class = pr iceLarge]'

但是我什至都不知道为什么,因为在页面的来源中我找不到这样的标签...
那么它为什么起作用?以及如何获得首页上的价格?
谢谢!

解决方案

使用PHP进行抓取时,您不能仅仅将浏览器源代码中的内容视为理所当然。



相反,您首先需要使用PHP获取内容,然后在其中查看源代码:

  $ url ='http://www.amazon.com/ ...'; 
$ buffer = file_get_contents($ url);

变量 $ buffer 然后包含HTML



在您的示例链接中所做的操作将显示,第一个和第二个地址都具有 .priceLarge 可能包含您要查找的内容:

 < span class = priceLarge> $ 168.00 < / span> 
< b class = priceLarge> $ 14.99< / b>

找出要查找的数据后,可以创建 DOMDocument

  $ doc = new DOMDocument(); 
$ doc-> recover = true;
$ saved = libxml_use_internal_errors(true);
$ doc-> loadHTML($ buffer);

您可能还对解析错误感兴趣:

  / ** @var array | LibXMLError [] $ errors * / 
$ errors = libxml_get_errors();
foreach($ errors为$ error){
printf(
%s:(%d)[%'3d]#%05d:%'-4d%s\n ,get_class($ error),$ error-> level,$ error-> code,$ error->行,
$ error->列,rtrim($ error-> message)
);
}
libxml_use_internal_errors($ saved);

因为这是 DOMDocument 告诉您问题出在哪里的一种方式。例如重复的ID值。



将缓冲区加载到 DOMDocument 后,您可以创建 DOMXPath

  $ xp =新的DOMXPath($ doc); 

您将使用它从文档中获取实际值。



例如,这两个示例地址HTML显示您正在寻找的信息是 #priceBlock 都包含 .listprice .priceLarge

  $ priceBlock = $ doc-> getElementById('priceBlock'); 
printf(
标价:%s\n价格:%s\n
,$ xp-> evaluate('string(.//*[@ class = listprice])',$ priceBlock)
,$ xp-> evaluate('string(.//*[@ class = priceLarge])',$ priceBlock)
);

这将导致以下输出:

 标价:$ 48.99 
价格:$ 14.99

如果您缺少某些内容,请在示例中将父节元素获取为 $ priceBlock 变量,这不仅允许您为Xpath使用相对路径,而且还可以进行调试,以防丢失一些更详细的信息:

  echo $ doc-> saveHTML($ priceBlock) ; 

这将输出整个< div> 例如,其中包含所有定价信息。



如果您为自己设置了一些帮助程序类,则可以进一步使用该功能从文档中获取其他有用的信息以进行抓取,例如显示价格块内的所有标签/类组合:

  //您可以在答案$ b的末尾找到StringCollector $ b $ tagsWithClass = new StringCollector(); 
foreach($ xp-> evaluate('.//*/@ class',$ priceBlock)as $ class){
$ tagsWithClass-> add(sprintf(%s。%s ,$ class-> parentNode-> tagName,$ class-> value));
}
echo $ tagsWithClass;

然后输出收集到的字符串及其数量的列表,此处为标记名及其类属性值:

  table.product(1)
td.priceBlockLabel(3)
span.listprice(1 )
td.priceBlockLabelPrice(1)
b.priceLarge(1)
tr.youSavePriceRow(1)
td.price(1)

如您所见,这是第一个示例URL,因为 .pricelarge < b> 元素。



这是一个相对简单的助手,对于抓取您可以做更多的事情,例如

  DomTree :: dump($ priceBlock); 

  DomTree :: dump($ priceBlock); 


它将为您提供以下输出,它不仅可以使 DOMDocument更好地使用:: saveHTML($ node)

 `< div id = priceBlock class = 购买> 
+ \n\n
`< table class = product>
+< tr>
| +< td class = priceBlockLabel>
| | `标价:
| + \n
| +< td>
| | `< span id = listPriceValue class = listprice>
| | ` $ 48.99
| ` \n
+< tr id = actualPriceRow>
| +< td id = actualPriceLabel class = priceBlockLabelPrice>
| | 价格:
| + \n
| +< td id = actualPriceContent>
| | +< span id = actualPriceValue>
| | | `< b class = priceLarge>
| | | ` $ 14.99
| | + \n
| | `< span id = actualPriceExtraMessaging>
| | + \n \n\n\n
| | +< span>
| | | ` \n \n
| | + \n \n\n\n\n\n\n\n\n\n\n \n\n\n\n\n \n \n\n\n\n\n&
| | +< b>
| | | `免费送货
| | +订单满$ 35。\n\n\n\n
| | +< a href = / gp / help / customer / display.html / ref = mk_sss_dp_1 / 191-4381493-1931545?ie = UTF8& no ...>
| | | `详细信息
| | ` \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n bn\n \n
| ` \n
+< tr id = dealPriceRow>
| +< td id = dealPriceLabel class = priceBlockLabel>
| | `交易价格:
| + \n
| +< td id = dealPriceContent>
| | + \n
| | +< span id = dealPriceValue>
| | + \n
| | +< span id = dealPriceExtraMessaging>
| | ` \n
| ` \n
+< script>
| `[XML_CDATA_SECTION_NODE(4)]
+< tr id = youSaveRow class = youSavePriceRow>
| +< td id = youSaveLabel class = priceBlockLabel>
| | `您保存:
| + \n
| +< td id = youSaveContent class = price>
| | +< span id = youSaveValue>
| | | ` $ 34.00\n(69%)
| | ` \n
| ` \n
`< tr>
+< td>
`< td>
`< span>
` ooooooooooooooooooooooooo ooooooooooooooooooo ...

您可以在中找到它href = https://stackoverflow.com/a/8631974/367456> 在PHP 另一个代码可以在github上作为要点使用






StringCollector 助手类

  / ** 
* StringCollector类
*
*收集字符串并对其进行计数
* /
类StringCollector实现IteratorAggregate
{
private $ array;

公共功能add($ string)
{
$ entry =& $ this-> array [$ string];
$ entry ++;
}

公共函数getIterator()
{
return new ArrayIterator($ this-> array);
}

公共函数__toString()
{
$ buffer =’;
foreach($ this as $ string => $ count){
$ buffer。= sprintf(%s(%d)\n,$ string,$ count);
}
返回$ buffer;
}
}


in the following page:

http://www.amazon.com/Jessica-Simpson-Womens-Double-Breasted/dp/B00K65ZMCA/ref=sr_1_4_mc/185-0705108-6790969?s=apparel&ie=UTF8&qid=1413083859&sr=1-4 I am trying to get the price with the expression

'//span[@id="priceblock_ourprice"]'

but the result is an empty variable.

the interesting part is that In other amazon pages, like this one : http://www.amazon.com/SanDisk-Cruzer-Frustration-Free-Packaging--SDCZ36-032G-AFFP/dp/B007JR532M/ref=sr_1_1?s=pc&ie=UTF8&qid=1413084653&sr=1-1&keywords=usb

I do have an expression that works

'//b[@class="priceLarge"]'

But i dont even know why, because in the source of the page I cant find such a tag... So why does it work? and how do I get the price on the first page? Thanks!

解决方案

When scraping with PHP you can not just take what you see in the browser source for granted.

Instead you first need to fetch the content with PHP and then look into the source there:

$url    = 'http://www.amazon.com/ ... ';
$buffer = file_get_contents($url);

The variable $buffer then contains the HTML that you will be scraping.

Done that with your example links will show that for the first and second address both have an element of .priceLarge containing probably what you're looking for:

<span class="priceLarge">$168.00</span>
<b class="priceLarge">$14.99</b>

After finding out where the data is you're looking for, you can create the DOMDocument:

$doc          = new DOMDocument();
$doc->recover = true;
$saved        = libxml_use_internal_errors(true);
$doc->loadHTML($buffer);

You might also be interested in parsing errors:

/** @var array|LibXMLError[] $errors */
$errors = libxml_get_errors();
foreach ($errors as $error) {
    printf(
        "%s: (%d) [%' 3d] #%05d:%' -4d %s\n", get_class($error), $error->level, $error->code, $error->line,
        $error->column, rtrim($error->message)
    );
}
libxml_use_internal_errors($saved);

as this is a way that DOMDocument tells you where problems occured. For example duplicate ID values.

After loading the buffer into DOMDocument you can create the DOMXPath:

$xp = new DOMXPath($doc);

You will use it to obtain the actual values from the document.

For example those two example addresses HTML hasshown that the information you're looking for is the #priceBlock both containing .listprice and .priceLarge:

$priceBlock = $doc->getElementById('priceBlock');
printf(
    "List Price: %s\nPrice: %s\n"
    , $xp->evaluate('string(.//*[@class="listprice"])', $priceBlock)
    , $xp->evaluate('string(.//*[@class="priceLarge"])', $priceBlock)
);

Which will result in the following output:

List Price: $48.99
Price: $14.99

If you're missing something, obtaining a parent section element into a variable as $priceBlock in the example does not only allow you to use relative paths for Xpath but can also help with debugging in case you're missing some of the more detailed information:

echo $doc->saveHTML($priceBlock);

This outputs the whole <div> that contains all pricing information for example.

If you setup yourself some helper classes, you can further on use this to obtain other useful information from the document for scraping it, like showing all tag/class combinations within the price-block:

// you can find StringCollector at the end of the answer
$tagsWithClass = new StringCollector();
foreach ($xp->evaluate('.//*/@class', $priceBlock) as $class) {
    $tagsWithClass->add(sprintf("%s.%s", $class->parentNode->tagName, $class->value));
}
echo $tagsWithClass;

This then outputs the list of collected strings and their count which is here the tagnames with their class attribute values:

table.product (1)
td.priceBlockLabel (3)
span.listprice (1)
td.priceBlockLabelPrice (1)
b.priceLarge (1)
tr.youSavePriceRow (1)
td.price (1)

As you can see, this is from the first example URL because .pricelarge is with a <b> element.

This is a relative simple helper, for scraping you can do more, like displaying the whole HTML structure in form of a tree.

DomTree::dump($priceBlock);

It will give you the following output which allows for better consumption than just DOMDocument::saveHTML($node):

`<div id="priceBlock" class="buying">
  +"\n\n  "
  `<table class="product">
    +<tr>
    | +<td class="priceBlockLabel">
    | | `"List Price:"
    | +"\n    "
    | +<td>
    | | `<span id="listPriceValue" class="listprice">
    | |   `"$48.99"
    | `"\n  "
    +<tr id="actualPriceRow">
    | +<td id="actualPriceLabel" class="priceBlockLabelPrice">
    | | `"Price:"
    | +"\n    "
    | +<td id="actualPriceContent">
    | | +<span id="actualPriceValue">
    | | | `<b class="priceLarge">
    | | |   `"$14.99"
    | | +"\n    "
    | | `<span id="actualPriceExtraMessaging">
    | |   +"\n        \n\n\n    "
    | |   +<span>
    | |   | `"\n        \n    "
    | |   +"\n    \n\n\n\n\n\n\n\n\n\n    \n\n\n\n\n\n \n\n\n\n\n& "
    | |   +<b>
    | |   | `"FREE Shipping"
    | |   +" on orders over $35.\n\n\n\n"
    | |   +<a href="/gp/help/customer/display.html/ref=mk_sss_dp_1/191-4381493-1931545?ie=UTF8&no...">
    | |   | `"Details"
    | |   `"\n\n\n\n\n\n\n\n\n    \n\n    \n    \n\n\n\n\n\n      \n"
    | `"\n"
    +<tr id="dealPriceRow">
    | +<td id="dealPriceLabel" class="priceBlockLabel">
    | | `"Deal Price: "
    | +"\n  "
    | +<td id="dealPriceContent">
    | | +"\n    "
    | | +<span id="dealPriceValue">
    | | +"\n    "
    | | +<span id="dealPriceExtraMessaging">
    | | `"\n  "
    | `"\n"
    +<script>
    | `[XML_CDATA_SECTION_NODE (4)]
    +<tr id="youSaveRow" class="youSavePriceRow">
    | +<td id="youSaveLabel" class="priceBlockLabel">
    | | `"You Save:"
    | +"\n    "
    | +<td id="youSaveContent" class="price">
    | | +<span id="youSaveValue">
    | | | `"$34.00\n        (69%)"
    | | `"\n    "
    | `"\n  "
    `<tr>
      +<td>
      `<td>
        `<span>
          `"o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o..."

You can find it referenced in an answer to Debug a DOMDocument Object in PHP and in another one. The code is available on github as a gist.


The StringCollector helper class

/**
 * Class StringCollector
 *
 * Collect strings and count them
 */
class StringCollector implements IteratorAggregate
{
    private $array;

    public function add($string)
    {
        $entry = & $this->array[$string];
        $entry++;
    }

    public function getIterator()
    {
        return new ArrayIterator($this->array);
    }

    public function __toString()
    {
        $buffer = '';
        foreach ($this as $string => $count) {
            $buffer .= sprintf("%s (%d)\n", $string, $count);
        }
        return $buffer;
    }
}

这篇关于使用Xpath从Amazon获取价格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆