将简单的HTML DOM空间插入类 [英] Simple HTML DOM spaces into class

查看:67
本文介绍了将简单的HTML DOM空间插入类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用简单HTML DOM从网站获取元素,但是当class属性包含空格时,我什么也没得到.

来自

EDIT5

team.php

 <?php//START team.php班级团队{公开的$ name,$ matches,$ wins,$ draws,$ loss,$ goals;公共静态函数parseRow($ row):?self{$ result =新的self();$ result->名称= $ result-> parseMatch($ row,'span.team_name_span a');if(null === $ result-> name){返回null;//甚至无法匹配名称,可能不是团队行,请跳过它}$ result-> matches = $ result-> parseMatch($ row,'td.col_matches_played');$ result-> wins = $ result-> parseMatch($ row,'td.col_wins');$ result-> draws = $ result-> parseMatch($ row,'td.col_draws');$ result->损失= $ result-> parseMatch($ row,'td.col_losses');$ result->目标= $ result-> parseMatch($ row,'td.col_goals');返回$ result;}私有函数parseMatch($ row,$ selector){if(!empty($ match = $ row-> find($ selector,0))){返回$ match->内文;}返回null;}}//结束team.php?> 

clas.php

 <?phpinclude('../simple_html_dom.php');include('../team.php');函数getHTML($ url,$ timeout){$ ch = curl_init($ url);//使用给定的URL初始化curlcurl_setopt($ ch,CURLOPT_USERAGENT,$ _SERVER ["HTTP_USER_AGENT"]);;//设置useragentcurl_setopt($ ch,CURLOPT_RETURNTRANSFER,true);//将响应写入变量curl_setopt($ ch,CURLOPT_FOLLOWLOCATION,true);//跟随重定向(如果有)curl_setopt($ ch,CURLOPT_CONNECTTIMEOUT,$ timeout);//最大限度.执行秒数curl_setopt($ ch,CURLOPT_FAILONERROR,1);//遇到错误时停止curl_setopt($ ch,CURLOPT_SSL_VERIFYPEER,false);curl_setopt($ ch,CURLOPT_SSL_VERIFYHOST,false);返回@curl_exec($ ch);}$ response = getHTML("http://www.betexplorer.com/soccer/england/premier-league/standings/?table=table&table_sub=home&ts=WOO1nDO2&dcheck=0",10);;$ html = str_get_html($ response);//START DOM解析块$ teams = [];foreach($ html-> find('table.stats-table tr')as $ row){$ team = Team :: parseRow($ row);//如果可能的话,将该行加载到Team对象中//如果该条目与该行不匹配,则跳过该条目if(null!== $ team){//此处实际执行的操作等同于以下OOP://$ teams [] = ['name'=>$ row-> find('span.team_name_span a',0)-> innertext,...];$ teams [] = $ team;}}foreach($ teams为$ team){echo $ team-> name;回声$ team->匹配项;}//END DOM解析块?> 

解决方案

解决方案:

执行以下操作以匹配两个类名都相同的td元素:

  $ wins = $ html-> find("td.wins.col_wins");$ draws = $ html-> find("td.draws.col_draws");$ losses = $ html-> find("td.losses.col_losses"); 

此外,只要您能做到,HTML标记就不需要您将两个类都匹配即可获取数据.

  $ wins = $ html-> find("td.col_wins");$ draws = $ html-> find("td.col_draws");$ losses = $ html-> find("td.col_losses"); 

获取重复的选择器(遍历行).

您要提取的是来自表行的数据数组.更具体地说,看起来像这样:

  $ teams = [['阿森纳',比赛,胜利,...],['利物浦',比赛,胜利,...],...]; 

这意味着您将需要对表的每一行运行相同的数据提取.SimpleHtmlDom通过类似于jQuery的 find 方法使此操作变得容易,该方法可以从任何匹配的元素中调用.

完整解决方案

此解决方案实际上定义了一个 Team 对象,用于将每行的数据加载到其中.应该使将来的调整更加简单.

这里要注意的重要部分是,首先我们以 $ row 的形式遍历每个表行,然后从 $ row-> find([selector]).

 //START team.php班级团队{公开的$ name,$ matches,$ wins,$ draws,$ loss,$ goals;公共功能__construct($ row){$ this->名称= $ this-> parseMatch($ row,'span.team_name_span a');if(null === $ this-> name){返回;//甚至无法匹配名称,可能不是团队行,请跳过它}$ this-> matches = $ this-> parseMatch($ row,'td.col_matches_played');$ this-> wins = $ this-> parseMatch($ row,'td.col_wins');$ this-> draws = $ this-> parseMatch($ row,'td.col_draws');$ this->损失= $ this-> parseMatch($ row,'td.col_losses');$ this-> goals = $ this-> parseMatch($ row,'td.col_goals');}私有函数parseMatch($ row,$ selector){if(!empty($ match = $ row-> find($ selector,0))){返回$ match->内文;}返回null;}公共函数isValid(){返回null!== $ this-> name;}公共函数getMatchData()//示例{返回< br< b>".$ this->获胜.":'.$ this-> matches.</b>";}}//结束team.php//START DOM解析块$ teams = [];foreach($ html-> find('table.stats-table tr')as $ row){$ team =新团队($ row);//如果可能的话,将该行加载到Team对象中//如果该条目与该行不匹配,则跳过该条目如果($ team-> isValid()){//此处实际执行的操作等同于以下OOP://$ teams [] = ['name'=>$ row-> find('span.team_name_span a',0)-> innertext,...];$ teams [] = $ team;}}foreach($ teams作为$ team){echo< h1>".$ team->名称.</h1>";回声$ team->损失;回声$ team-> getMatchData();}//END DOM解析块 

I'm using Simple HTML DOM to get elements from a website, but when class attribute has spaces, I don't get anything.

Source HTML from betaexplorer.com

<table id="table-type-2" class="stats-table stats-main table-2">
    <tbody>
    <tr class="odd glib-participant-ppjDR086" data-def-order="0">
        <td class="rank col_rank no" title="">1.</td>
        <td class="participant_name col_participant_name col_name"><span class="team_name_span"><a onclick="javascript:getUrlByWinType('/soccer/england/premier-league/teaminfo.php?team_id=ppjDR086');">Manchester United</a></span></td>
        <td class="matches_played col_matches_played">4</td>
        <td class="wins col_wins">4</td>
        <td class="draws col_draws">0</td>
        <td class="losses col_losses">0</td>
        <td class="goals col_goals">14:0</td>
        <td class="goals col_goals">12</td>
    </tr>
    <tr class="even glib-participant-hA1Zm19f" data-def-order="1">
        <td class="rank col_rank no" title="">2.</td>
        <td class="participant_name col_participant_name col_name"><span class="team_name_span"><a onclick="javascript:getUrlByWinType('/soccer/england/premier-league/teaminfo.php?team_id=hA1Zm19f');">Arsenal</a></span></td>
        <td class="matches_played col_matches_played">4</td>
        <td class="wins col_wins">4</td>
        <td class="draws col_draws">0</td>
        <td class="losses col_losses">0</td>
        <td class="goals col_goals">11:3</td>
        <td class="goals col_goals">12</td>
    </tr>
    <tr class="odd glib-participant-Wtn9Stg0" data-def-order="2">
        <td class="rank col_rank no" title="">3.</td>
        <td class="participant_name col_participant_name col_name"><span class="team_name_span"><a onclick="javascript:getUrlByWinType('/soccer/england/premier-league/teaminfo.php?team_id=Wtn9Stg0');">Manchester City</a></span></td>
        <td class="matches_played col_matches_played">4</td>
        <td class="wins col_wins">3</td>
        <td class="draws col_draws">1</td>
        <td class="losses col_losses">0</td>
        <td class="goals col_goals">18:3</td>
        <td class="goals col_goals">10</td>
    </tr>
    </tbody>
</table>

My PHP code using SimpleHtmlDom

    <?php
include('../simple_html_dom.php');


function getHTML($url,$timeout)
{
       $ch = curl_init($url); // initialize curl with given url
       curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set  useragent
       curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
       curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
       curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
       curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
       curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
       curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
       return @curl_exec($ch);
}



$response=getHTML("http://www.betexplorer.com/soccer/england/premier-league/standings/?table=table&table_sub=home&ts=WOO1nDO2&dcheck=0",10);
$html = str_get_html($response);

$team = $html->find("span[class=team_name_span]/a"); 
$numbermatch = $html->find("td.matches_played.col_matches_played"); 
$wins = $html->find("td.wins.col_wins"); 
$draws = $html->find("td.draws.col_draws"); 
$losses = $html->find("td.losses.col_losses"); 
$goals = $html->find("td.goals.col_goals"); 

?>

<table border="1" width="100%">
    <thead>
        <tr>
            <th>Team</th>
            <th>MP</th>
            <th>W</th>
            <th>D</th>
            <th>L</th>
            <th>G</th>
        </tr>
    </thead>

<?php



foreach ($team as $match) {


echo  "<tr>".

            "<td class='first-cell'>".$match->innertext."</td> "  .
            "<td class='first-cell'>".$numbermatch->innertext."</td> "  .
            "<td class='first-cell'>".$wins->innertext."</td> "  .
            "<td class='first-cell'>".$draws->innertext."</td> "  .
            "<td class='first-cell'>".$losses->innertext."</td> "  .
            "<td class='first-cell'>".$goals->innertext."</td> "  .


            "</tr><br/>";



        }       



?>
</table>

So, I only get first value (because class name is without spaces), but I can't get the rest of values

EDIT: I fixed a mistake into PHP code. See again

EDIT2: It's not a duplicate, I tried that solution but It doesn't work

EDIT3: I tried to use advanced_html_dom (it should fix spaces problem), but I don't get anything (also just the only one I was getting)

EDIT4: In the screens below you can see what I'd like to get and what I get right now:

EDIT5

team.php

    <?php

// START team.php 
class Team
{
    public $name, $matches, $wins, $draws, $losses, $goals;

    public static function parseRow($row): ?self
    {
        $result = new self();
        $result->name = $result->parseMatch($row, 'span.team_name_span a');
        if (null === $result->name) {
            return null; // couldn't even match the name, probably not a team row, skip it
        }

        $result->matches = $result->parseMatch($row, 'td.col_matches_played');
        $result->wins = $result->parseMatch($row, 'td.col_wins');
        $result->draws = $result->parseMatch($row, 'td.col_draws');
        $result->losses = $result->parseMatch($row, 'td.col_losses');
        $result->goals = $result->parseMatch($row, 'td.col_goals');

        return $result;
    }

    private function parseMatch($row, $selector)
    {
        if (!empty($match = $row->find($selector, 0))) {
            return $match->innertext;
        }

        return null;
    }
}

// END team.php

?>

clas.php

    <?php

include('../simple_html_dom.php');
include('../team.php');


function getHTML($url,$timeout)
{
       $ch = curl_init($url); // initialize curl with given url
       curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set  useragent
       curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
       curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
       curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
       curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
       curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
       curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
       return @curl_exec($ch);
}



$response=getHTML("http://www.betexplorer.com/soccer/england/premier-league/standings/?table=table&table_sub=home&ts=WOO1nDO2&dcheck=0",10);
$html = str_get_html($response);



// START DOM parsing block
$teams = [];

foreach($html->find('table.stats-table tr') as $row) {
    $team = Team::parseRow($row); // load the row into a Team object if possible

    // skipp this entry if it couldn't match the row
    if (null !== $team) {
        // what were actually doing here is just the OOP equivalent of:
        // $teams[] = ['name' => $row->find('span.team_name_span a',0)->innertext, ...];
        $teams[] = $team;
    }
}

foreach($teams as $team) {
    echo $team->name;
    echo $team->matches;
}

// END DOM Parsing Block

?>

解决方案

Solution: http://phpfiddle.org/main/code/cq54-hta2

Class-names don't have spaces, don't try to match them

SimpleHtmlDom doesn't support attribute selectors like this. Plus you're tyring to match a class as though it has spaces in the class name. So, instead of this:

$wins = $html->find("td[class=wins col_wins]"); 
$draws = $html->find("td[class=draws col_draws]"); 
$losses = $html->find("td[class=losses col_losses]"); 

Do the following to match td elements which match BOTH of two class-names:

$wins = $html->find("td.wins.col_wins"); 
$draws = $html->find("td.draws.col_draws"); 
$losses = $html->find("td.losses.col_losses"); 

Additionally, that HTML markup doesn't require you to match both classes to get the data, should you could simply do:

$wins = $html->find("td.col_wins"); 
$draws = $html->find("td.col_draws"); 
$losses = $html->find("td.col_losses"); 

Getting repeated selectors (looping through rows).

What you are trying to extract is the an array of data from the rows of a table. More specifically, something that looks like this:

$teams = [
    ['Arsenal', matches, wins, ...],
    ['Liverpool', matches, wins, ...],
    ...
];

This means you'll need to run the same data-extraction against each row of the table. SimpleHtmlDom makes this easy through jQuery-like find methods, which can be called from any matched element.

Complete Solution

This solution actually defines a Team object to load each row's data into. Should make future adjustments much simpler.

The important piece to note here, is that first we loop through every table-row as $row, and collect the team and numbers from $row->find([selector]).

// START team.php 
class Team
{
    public $name, $matches, $wins, $draws, $losses, $goals;

    public function __construct($row)
    {
        $this->name = $this->parseMatch($row, 'span.team_name_span a');
        if (null === $this->name) {
            return; // couldn't even match the name, probably not a team row, skip it
        }

        $this->matches = $this->parseMatch($row, 'td.col_matches_played');
        $this->wins = $this->parseMatch($row, 'td.col_wins');
        $this->draws = $this->parseMatch($row, 'td.col_draws');
        $this->losses = $this->parseMatch($row, 'td.col_losses');
        $this->goals = $this->parseMatch($row, 'td.col_goals');
    }

    private function parseMatch($row, $selector)
    {
        if (!empty($match = $row->find($selector, 0))) {
            return $match->innertext;
        }

        return null;
    }

    public function isValid()
    {
        return null !== $this->name;
    }

    public function getMatchData() //example
    {
        return "<br><b>". $this->wins .' : '. $this->matches . "</b>";
    }
}

// END team.php

// START DOM parsing block
$teams = [];

foreach($html->find('table.stats-table tr') as $row) {
    $team = new Team($row); // load the row into a Team object if possible

    // skipp this entry if it couldn't match the row
    if ($team->isValid()) {
        // what were actually doing here is just the OOP equivalent of:
        // $teams[] = ['name' => $row->find('span.team_name_span a',0)->innertext, ...];
        $teams[] = $team;
    }
}

foreach($teams as $team) {
    echo "<h1>".$team->name."</h1>";
    echo $team->losses;
    echo $team->getMatchData();
}

// END DOM Parsing Block

这篇关于将简单的HTML DOM空间插入类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆