使用简单的dom解析器和分页从网上商店获取产品 [英] Get products from e-shop using simple dom parser and pagination

查看:95
本文介绍了使用简单的dom解析器和分页从网上商店获取产品的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想解析一些产品的链接,名称和价格.这是我的代码:解析时遇到了一些麻烦,因为我不知道如何获得产品链接和名称.价格还可以,我知道了.分页效果不佳

I want to parse some products link, name and price. Here's my code: Having some trouble parsing, because I don't know how to get product link's and name's.Price is ok, I get it. And pagination not working as well

 <h2>Telefonai Pigu</h2>
</br>
<?php
  include_once('simple_html_dom.php'); 
  $url = "http://pigu.lt/foto_gsm_mp3/mobilieji_telefonai/";
  // Start from the main page
  $nextLink = $url;

// Loop on each next Link as long as it exsists
while ($nextLink) {
echo "<hr>nextLink: $nextLink<br>";
//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a url
$html->load_file($nextLink);


$phones = $html->find('div#productList span.product');

foreach($phones as $phone) {
    // Get the link
    $linkas = $phone->href;

    // Get the name
    $pavadinimas = $phone->find('a[alt]', 0)->plaintext;

    // Get the name price and extract the useful part using regex
    $kaina = $phone->find('strong[class=nw]', 0)->plaintext;
    // This captures the integer part of decimal numbers: In "123,45" will capture      "123"... Use @([\d,]+),?@ to capture the decimal part too

    echo $pavadinimas, " #----# ", $kaina, " #----# ", $linkas, "<br>";

  //$query = "insert into telefonai (pavadinimas,kaina,linkas) VALUES (?,?,?)";
//  $this->db->query($query, array($pavadinimas,$kaina, $linkas));
}


// Extract the next link, if not found return NULL
$nextLink = ( ($temp = $html->find('div.pagination a[="rel"]', 0)) ? "https://www.pigu.lt".$temp->href : NULL );

// Clear DOM object
$html->clear();
unset($html);
}
?>

输出:

nextLink: http://pigu.lt/foto_gsm_mp3/mobilieji_telefonai/
A PHP Error was encountered
Severity: Notice
Message: Trying to get property of non-object
Filename: views/pigu_view.php
Line Number: 26
#----# 999,00 Lt #----#
A PHP Error was encountered
Severity: Notice
Message: Trying to get property of non-object
Filename: views/pigu_view.php
Line Number: 26

推荐答案

请仔细检查您正在处理的源代码,然后在此基础上,可以检索所需的节点...兼容是正常的与另一个网站的代码在这里不起作用,因为这两个网站没有相同的源代码/结构!

Please Inspect carefully the source code you're working on, then, based on that, you can retrive the nodes you want... It's normal that the compatible code with another website wont work here, since the two websites dont have the same source code/structure !

让我们再次逐步进行...

Lets proceed, again, step by step...

$phones = $html->find('div#productList span.product');将为您提供所有的电话容器"或我所说的块" ...每个块都具有以下结构:

$phones = $html->find('div#productList span.product'); will give you all "phones containers", or what I called "blocks"... Each block has the following structure:

<span class="product">
   <div class="fakeProductContainer">
      <p class="productPhoto">
         <span class="">
         <span class="flags flag-disc-value" title="Akcija"><strong>500<br><span class="currencySymbol">Lt</span></strong></span>
         <span class="flags freeShipping" title="Nemokamas prekių atsiemimas į POST24 paštomatus. Pasiūlymas galioja iki sausio 31 d."></span>
         </span>
         <a href="/foto_gsm_mp3/mobilieji_telefonai/telefonas_sony_xperia_acro_s?id=4522595" title="Telefonas Sony Xperia acro S" class="photo-medium nobr"><img src="http://lt1.pigugroup.eu//colours/48355/16/4835516/c503caf69ad97d889842a5fd5b3ff372_medium.jpg" title="Telefonas Sony Xperia acro S" alt="Telefonas Sony Xperia acro S"></a>
      </p>
      <div class="price">
         <strong class="nw">999,00 Lt</strong>
         <del class="nw">1.499,00 Lt *</del>
      </div>
      <h3><a href="/foto_gsm_mp3/mobilieji_telefonai/telefonas_sony_xperia_acro_s?id=4522595" title="Telefonas Sony Xperia acro S">Sony Xperia acro S</a></h3>
      <p class="descFields">
         3G: <em>HSDPA 14.4 Mbps, HSUPA 5.76 Mbps</em><br>
         GPS: <em>Taip</em><br>
         NFC: <em>Taip</em><br>
         Operacinė sistema: <em>Android OS</em><br>
      </p>
   </div>
</span>

<p class="productPhoto">中包含包含产品链接的锚点,并且它是其中的唯一锚点,因此,只需使用$linkas = $phone->find('p.productPhoto a', 0)->href;即可检索它(然后仅需通过$linkas = $phone->find('p.productPhoto a', 0)->href;即可完成检索)相对链接)

The anchor containing the product link an is included within <p class="productPhoto">, and it is the only anchor in there, so, to retrieve it simply use $linkas = $phone->find('p.productPhoto a', 0)->href; (then complete it since it's only the relative link)

产品名称位于<h3>标签中,同样,我们仅使用$pavadinimas = $phone->find('h3 a', 0)->plaintext;进行检索

The product name is located into <h3> tag, again, we use simply $pavadinimas = $phone->find('h3 a', 0)->plaintext; to retrieve it

价格包含在<div class="price"><strong>中,我们再次使用$kaina = $phone->find('div[class=price] strong', 0)->plaintext进行检索

The price is included within <div class="price"><strong>, and again we use $kaina = $phone->find('div[class=price] strong', 0)->plaintext to retrieve it

但是,并非所有手机都显示其价格,因此,我们必须检查价格是否已正确检索

Hoever, not all phones have their price displayed, therefore, we must check if the price has been retrieved correctly or not

最后,包含下一个链接的HTML代码如下:

And finally, the HTML code containing the next link is the following:

<div id="ListFootPannel">
   <div class="pages-list">
      <strong>1</strong>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=2">2</a>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=3">3</a>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=4">4</a>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=5">5</a>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=6">6</a>
      <a rel="next" href="/foto_gsm_mp3/mobilieji_telefonai?page=2">Toliau</a>      
   </div>
   <div class="pages-info">
      Prekių 
   </div>
</div>

因此,我们对<a rel="next">标记感兴趣,可以使用$html->find('div#ListFootPannel a[rel="next"]', 0)

So, we are interested in <a rel="next"> tag, wich can be retrieved using $html->find('div#ListFootPannel a[rel="next"]', 0)

因此,如果我们将这些修改添加到您的原始代码中,则会得到:

So, if we make add these modifications to your original code, we'll get:

$url = "http://pigu.lt/foto_gsm_mp3/mobilieji_telefonai/";

// Start from the main page
$nextLink = $url;

// Loop on each next Link as long as it exsists
while ($nextLink) {
    echo "nextLink: $nextLink<br>";
    //Create a DOM object
    $html = new simple_html_dom();
    // Load HTML from a url
    $html->load_file($nextLink);

    ////////////////////////////////////////////////
    /// Get phone blocks and extract useful info ///
    ////////////////////////////////////////////////
    $phones = $html->find('div#productList span.product');

    foreach($phones as $phone) {
        // Get the link
        $linkas = "http://pigu.lt" . $phone->find('p.productPhoto a', 0)->href;

        // Get the name
        $pavadinimas = $phone->find('h3 a', 0)->plaintext;

        // If price not found, find() returns FALSE, then return 000
        if ( $tempPrice = $phone->find('div[class=price] strong', 0) ) {
            // Get the name price and extract the useful part using regex
            $kaina = $tempPrice->plaintext;
            // This captures the integer part of decimal numbers: In "123,45" will capture "123"... Use @([\d,]+),?@ to capture the decimal part too
            preg_match('@(\d+),?@', $kaina, $matches);
            $kaina = $matches[1];
        }
        else
            $kaina = "000";


        echo $pavadinimas, " #----# ", $kaina, " #----# ", $linkas, "<br>";

    }
    ////////////////////////////////////////////////
    ////////////////////////////////////////////////

    // Extract the next link, if not found return NULL
    $nextLink = ( ($temp = $html->find('div#ListFootPannel a[rel="next"]', 0)) ? "http://pigu.lt".$temp->href : NULL );

    // Clear DOM object
    $html->clear();
    unset($html);

    echo "<hr>";
}

工作演示

这篇关于使用简单的dom解析器和分页从网上商店获取产品的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆