HTML表格来数组PHP [英] HTML table to array PHP

查看:100
本文介绍了HTML表格来数组PHP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个schoolcalendar在线,但我想在我自己的应用程序。
不幸的是,我无法使用PHP和正则表达式来处理它。



问题是表格单元格没有被均分,每个类都会改变。
您可以找到时间表 here 这里



我试过的正则表达式是这样的:

 < td rowspan ='(?:[0-9] {1,3})'class ='value'>(。+?)< br />>(。+? )<峰; br />?(+)<峰; br /><峰; br /><峰; br />< / TD> 

但它无法正常工作!



结束数组必须看起来像这样:

  [0] =>数组

[0] => maandag //日期
[1] => 1 //课程周期
[2] =& dm // cell的内容

我希望这个问题很清楚,因为我不是英国人;) 解决方案

这一个好运气,这将是棘手的...只是'使用HTML解析器实际上不会避免主要问题,这是使用rowspans的表的本质。虽然使用HTML Parser解析大量HTML总是好的建议,但如果您可以将HTML分解为更小,更可靠的块 - 那么使用其他技术进行解析始终会更加优化(但显然在HTML中更容易出现细微的意外差异) 是我,我会开始与东西,可以检测到你的表开始和结束(因为我不想分析整个页面,即使使用HTML解析器,如果我不需要)

  $ table = $ start = $ end = false; 
///'Vrijdag'应该是唯一的,但如果它出现在其他地方会失败
$ pos = strpos($ html,'Vrijdag');
///根据可靠的标签查找您的开始和结束
if($ pos!== false){
$ start = stripos($ html,'< tr>',$ POS);
if($ start!== false){
$ end = stripos($ html,'< / table>,$ start);


$ b $ if($ start!== false&& $ $ end!== false){
///我们现在可以抓住我们的表格$ html;
$ table = substr($ html,$ start,$ end - $ start);
}

然后由于偶然的方式,单元格垂直跨越<​​em>(但我会选择一个'日'列并向下工作。

  if($ table ){
///基于行分解
$ rows = preg_split('#< / tr> #i',$ table);
///
foreach($ rows为$ key => $ row){
$ rows [$ key] = preg_split('#< / td> #i',$行);


$ / code $ / pre

上面应该给你类似的东西:

  array(
'0'=> array(
'0'=>< td class ='标题'> 1,
'1'=>< td rowspan ='1'class ='empty'>
'2'=>< td rowspan ='5'class ='value'> 3D< br /> 009< br />< br />< br />
),
'0'=>数组(
'0'=>< td class ='heading'> 2,
'1'= >< td rowspan ='2'class ='empty'>
'2'=>< td rowspan ='3'class ='value'> Hk< br /> ;
...
),

现在你可以扫描每一行,并且你在preg_match一个行范围内创建一个单元格信息的副本到下面的行中(在正确的位置),所以实际创建一个完整的表结构(wi thout rowspans)

  ///不能在这里使用foreach,因为我们想修改数组中的数组循环
$ lof = count($ rows); ($ rkey = 0; $ rkey< $ lof; $ rkey ++){
///拉出行
$ row = $ rows [$ rkey];
;
foreach($ row as $ ckey => $ cell){
if(preg_match('/ rowspan =。([0-9] +)./',$ cell,$ regs)) {
$ rowspan =(int)$ regs [1];
if($ rowspan> 1){
///这里有一个问题,我后来意识到我正在构建
///替换模式,看起来像'$ 14 $ 2' 。这意味着
///系统试图在偏移14处找到一个组。为了解决这个
///问题,PHP允许用{}包装组参考号。
///,所以我们现在得到'$ 1'和'$ 2'的值,并在文字数字周围插入
$ newcell = preg_replace('/(rowspan =。)[0-9] +(。 )/','$ {1}'。($ rowspan-1)。'$ {2}',$ cell);
array_splice($ rows [$ rkey + 1],$ ckey,$ newcell);
}
}
}
}

上面应该规范表格,以便rowspans不再成为问题。



(请注意以上是理论代码,我已经手动输入并且有但我很快就会这样做)



经过测试



有几个小错误与上面我已更新,即获得PHP的某些函数的参数错误的方式......排序后,它似乎工作:

  ///获取html 
$ html = file_get_contents('http://www.cibap.nl/beheer/modules/roosters/create_rooster.php?element=CR13A& soort = KLAS&安培;周= 37安培; JAAR = 2012' );

///从无到有
$ table = $ start = $ end = false;
///'Vrijdag'应该是唯一的,但如果它出现在其他地方会失败
$ pos = strpos($ html,'Vrijdag');

///根据可靠的标签查找您的开始和结束
if($ pos!== false){
$ start = stripos($ html,'< tr> ;',$ pos);
if($ start!== false){
$ end = stripos($ html,'< / table>,$ start);
}
}

///确保我们有一个开始和结束
if($ start!== false&& $ end!==假){
///我们现在可以获取我们的表$ html;
$ table = substr($ html,$ start,$ end - $ start);
///将brs转换为不会被strip_tags移除的内容
$ table = preg_replace('#< br?/> #i',\\\
,$ table);


if($ table){
///基于行分解(close tr非常可靠)
$ rows = preg_split( '#< / tr> #i',$ table);
///分解单元格(关闭td非常可靠)
foreach($ rows as $ key => $ row){
$ rows [$ key] = preg_split('#< / td> #i',$ row);


else {
///创建所以我们避免错误
$ rows = array();
}

///将此处从foreach更改为a,因为它似乎是
/// foreach正在从$ rows的副本中进行处理,因此所有修改都是
///我们在发生循环时对$行进行了忽略。
$ lof = count($ rows); ($ rkey = 0; $ rkey< $ lof; $ rkey ++){
///拉出行
$ row = $ rows [$ rkey];
;
///将行中的每个单元格分隔
foreach($ row as $ ckey => $ cell){
///拉出rowspan值
if(preg_match ('/ rowspan =。([0-9] +)./',$ cell,$ regs)){
///如果rowspan大于1(即跨多行)
$ rowspan =(int)$ regs [1];
if($ rowspan> 1){
///然后将这个单元格复制到下一行,但是减少它的行数
///,以便当我们在下一次找到它时我们知道多少次
///它应该跨越多少次。
$ newcell = preg_replace('/(rowspan =。)([0-9] +)(。)/','$ {1}'。($ rowspan-1)。'$ {3}' ,$ cell);
array_splice($ rows [$ rkey + 1],$ ckey,0,$ newcell);
}
}
}
}

///现在终于步进标准化表格并摆脱不需要的标签
// / $同时将我们的值分割为更有用的
foreach($ rows为$ rkey => $ row){
foreach($ row as $ ckey => $ cell) {
$ rows [$ rkey] [$ ckey] = preg_split('/ \\\
+ /',trim(strip_tags($ cell)));
}
}

echo'< xmp>';
print_r($ rows);
echo'< / xmp>';


I have a schoolcalendar online, but I want to have it in my own application. Unfortunately I can't get it working with PHP and regex.

The problem is that the table cells are not divided equally and that it changes per class. You can find the schedule here and here.

The regex I tried is this:

<td rowspan='(?:[0-9]{1,3})' class='value'>(.+?)<br/>(.+?)<br/>(.+?)<br/><br/><br/></td>

But it does not work correctly!

The end array must look something like this:

[0] => Array
    (
        [0] => maandag //the day
        [1] => 1 //lesson period
        [2] => MEN, 16, dm //content of the cell
    )

I hope that this question is clear enough, because I'm not an English ;)

解决方案

Good luck with this one, it's going to be tricky... just 'using a HTML parser' isn't actually going to avoid the major problem, which is the nature of a table that uses rowspans. Although whilst it is always good advice to use a HTML Parser for parsing large amounts of HTML, if you can break that HTML down into smaller, reliable chunks - then parsing using other techniques is always going to be more optimal (but obviously more prone to subtle unexpected differences in the HTML).

Normalise the table

If it were me I'd start with something that can detect where your table starts and ends (as I wouldn't want to parse the entire page even when using a HTML Parser if I don't need to):

$table = $start = $end = false;
/// 'Vrijdag' should be unique enough, but will fail if it appears elsewhere
$pos = strpos($html, 'Vrijdag');
/// find your start and end based on reliable tags
if ( $pos !== false ) {
  $start = stripos($html, '<tr>', $pos);
  if ( $start !== false ) {
    $end = stripos($html, '</table>', $start);
  }
}

if ( $start !== false && $end !== false ) {
  /// we can now grab our table $html;
  $table = substr($html, $start, $end - $start);
}

Then due to the haphazard way the cells are spanned vertically (but seem to be uniform horizontally) I would choose a 'day' column and work downwards.

if ( $table ) {
  /// break apart based on rows
  $rows = preg_split('#</tr>#i', $table);
  ///
  foreach ( $rows as $key => $row ) {
    $rows[$key] = preg_split('#</td>#i', $row);
  }
}

The above should give you something like:

array (
  '0' => array (
    '0' => "<td class='heading'>1",
    '1' => "<td rowspan='1' class='empty'>"
    '2' => "<td rowspan='5' class='value'>3D<br/>009<br/>Hk<br/><br/><br/>"
    ...
  ),
  '0' => array (
    '0' => "<td class='heading'>2",
    '1' => "<td rowspan='2' class='empty'>"
    '2' => "<td rowspan='3' class='value'>Hk<br/>"
    ...
  ),
)

Now that you have that, you can scan across each row, and where you preg_match a rowspan, you'd have to create a copy of that cell's information into the row below (in the right place) so as to actually create a complete table structure (without rowspans).

/// can't use foreach here because we want to modify the array within the loop
$lof = count($rows);
for ( $rkey=0; $rkey<$lof; $rkey++ ) {
  /// pull out the row
  $row = $rows[$rkey];
  foreach ( $row as $ckey => $cell ) {
    if ( preg_match('/ rowspan=.([0-9]+)./', $cell, $regs) ) {
      $rowspan = (int) $regs[1];
      if ( $rowspan > 1 ) {
        /// there was a gotcha here, I realised afterwards i was constructing
        /// a replacement pattern that looked like this '$14$2'. Which meant
        /// the system tried to find a group at offset 14. To get around this
        /// problem, PHP allows the group reference numbers to be wraped with {}.
        /// so we now get the value of '$1' and '$2' inserted around a literal number
        $newcell = preg_replace('/( rowspan=.)[0-9]+(.)/', '${1}'.($rowspan-1).'${2}', $cell);
        array_splice( $rows[$rkey+1], $ckey, $newcell );
      }
    }
  }
}

The above should normalise the table so that the rowspans are no longer a problem.

(Please note the above is theoretical code, I've manually typed it and have yet to test it -- which I will be doing so shortly)

After testing

There were a few little bugs with the above that I have updated, namely getting php's arguments for certain functions round the wrong way... After sorting those it seems to work:

/// grab the html
$html = file_get_contents('http://www.cibap.nl/beheer/modules/roosters/create_rooster.php?element=CR13A&soort=klas&week=37&jaar=2012');

/// start with nothing
$table = $start = $end = false;
/// 'Vrijdag' should be unique enough, but will fail if it appears elsewhere
$pos = strpos($html, 'Vrijdag');

/// find your start and end based on reliable tags
if ( $pos !== false ) {
  $start = stripos($html, '<tr>', $pos);
  if ( $start !== false ) {
    $end = stripos($html, '</table>', $start);
  }
}

/// make sure we have a start and end
if ( $start !== false && $end !== false ) {
  /// we can now grab our table $html;
  $table = substr($html, $start, $end - $start);
  /// convert brs to something that wont be removed by strip_tags
  $table = preg_replace('#<br ?/>#i', "\n", $table);
}

if ( $table ) {
  /// break apart based on rows (a close tr is quite reliable to find)
  $rows = preg_split('#</tr>#i', $table);
  /// break apart the cells (a close td is quite reliable to find)
  foreach ( $rows as $key => $row ) {
    $rows[$key] = preg_split('#</td>#i', $row);
  }
}
else {
  /// create so we avoid errors
  $rows = array();
}

/// changed this here from a foreach to a for because it seems
/// foreach was working from a copy of $rows and so any modifications
/// we made to $rows while the loop was happening were ignored.
$lof = count($rows);
for ( $rkey=0; $rkey<$lof; $rkey++ ) {
  /// pull out the row
  $row = $rows[$rkey];
  /// step each cell in the row
  foreach ( $row as $ckey => $cell ) {
    /// pull out our rowspan value
    if ( preg_match('/ rowspan=.([0-9]+)./', $cell, $regs) ) {
      /// if rowspan is greater than one (i.e. spread across multirows)
      $rowspan = (int) $regs[1];
      if ( $rowspan > 1 ) {
        /// then copy this cell into the next row down, but decrease it's rowspan
        /// so that when we find it in the next row we know how many more times
        /// it should span down.
        $newcell = preg_replace('/( rowspan=.)([0-9]+)(.)/', '${1}'.($rowspan-1).'${3}', $cell);
        array_splice( $rows[$rkey+1], $ckey, 0, $newcell );
      }
    }
  }
}

/// now finally step the normalised table and get rid of the unwanted tags 
/// that remain at the same time split our values in to something more useful
foreach ( $rows as $rkey => $row ) {
  foreach ( $row as $ckey => $cell ) {
    $rows[$rkey][$ckey] = preg_split('/\n+/',trim(strip_tags( $cell )));
  }
}

echo '<xmp>';
print_r($rows);
echo '</xmp>';

这篇关于HTML表格来数组PHP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆