仅从php中使用preg_match_all的html表获取数据 [英] Get data only from html table used preg_match_all in php
问题描述
我有一个像这样的html表:
I have a html table like this :
<table ... >
<tbody ... >
<tr ... >
<td ...>
string...
</td>
<td ...>
string...
</td>
<td ...>
string...
</td>
<td ...>
string...
</td>
<td ...>
string...
</td>
</tr>
<tr ... >
<td ...>
string...
</td>
<td ...>
string...
</td>
<td ...>
string...
</td>
<td ...>
</td>
<td ...>
string...
</td>
</tr>
..............
</tbody>
</table>
这是一个数据表,我需要从中获取所有数据。
该表格有很多行(< tr>< / tr>
)。每行都有一个固定的列(< td>< / td>
)(目前为5)。
记住每个表,tr,td标签可能格式化(在哪里说......)
This is a data table and I need to get all data from this.
The table will have many rows (<tr></tr>
) . each row will have a fixed columns (<td></td>
)(currently is 5 ).
remember each table,tr,td tag maybe formatted (where say "...")
我希望每个人都可以帮我写一个正则表达式 preg_match_all
获取如下数据的函数:
And I hope everyone can help me to write a regex for preg_match_all
function to get the data like this :
array(
0 => array(
0=> 'some data0',
1=> 'some data1',
2=> 'some data2',
3=> 'some data3',
4=> 'some data4',
)
1 => array(
0=> 'some data0',
1=> 'some data1',
2=> 'some data2',
3=> 'some data3',
4=> 'some data4',
)
2 => array(
0=> 'some data0',
1=> 'some data1',
2=> 'some data2',
3=> 'some data3',
4=> 'some data4',
)
..........
)
现在你的测试的例子,你可以lp me !!!
Now the example for your test, hopfully you can help me!!!
<table border="1" >
<tbody style="" >
<tr style="" >
<td style="color:blue;">
data0
</td>
<td style="font-size:15px;">
data1
</td>
<td style="font-size:15px;">
data2
</td>
<td style="color:blue;">
data3
</td>
<td style="color:blue;">
data4
</td>
</tr>
<tr style="" >
<td style="color:blue;">
data00
</td>
<td style="font-size:15px;">
data11
</td>
<td style="font-size:15px;">
data22
</td>
<td style="color:blue;">
data33
</td>
<td style="color:blue;">
data44
</td>
</tr>
<tr style="color:black" >
<td style="color:blue;">
data000
</td>
<td style="font-size:15px;">
data111
</td>
<td style="font-size:15px;">
data222
</td>
<td style="color:blue;">
data333
</td>
<td style="color:blue;">
data444
</td>
</tr>
</tbody>
</table>
推荐答案
您绝对不想使用Regex解析HTML。
You absolutely do NOT want to parse HTML with Regex.
对于一个,有太多的变体,更重要的是,正则表达式与HTML的层次性质不是很好。最好使用XML解析器或更好的HTML特定解析器。
There are far too many variations, for one, and more importantly, regex isn't very good with the hierarchal nature of HTML. It's best to use an XML parser or better-yet an HTML-specific parser.
每当我需要刮HTML时,我倾向于使用简单的HTML DOM Parser 库,它接受一个HTML树并将其解析为可遍历的PHP对象,您可以查询类似JQuery的内容。
Whenever I need to scrape HTML, I tend to use the Simple HTML DOM Parser library, which takes an HTML tree and parses it into a traversable PHP object, which you can query something like JQuery.
<?php
require 'simplehtmldom/simple_html_dom.php';
$sHtml = <<<EOS
<table border="1" >
<tbody style="" >
<tr style="" >
<td style="color:blue;">
data0
</td>
<td style="font-size:15px;">
data1
</td>
<td style="font-size:15px;">
data2
</td>
<td style="color:blue;">
data3
</td>
<td style="color:blue;">
data4
</td>
</tr>
<tr style="" >
<td style="color:blue;">
data00
</td>
<td style="font-size:15px;">
data11
</td>
<td style="font-size:15px;">
data22
</td>
<td style="color:blue;">
data33
</td>
<td style="color:blue;">
data44
</td>
</tr>
<tr style="color:black" >
<td style="color:blue;">
data000
</td>
<td style="font-size:15px;">
data111
</td>
<td style="font-size:15px;">
data222
</td>
<td style="color:blue;">
data333
</td>
<td style="color:blue;">
data444
</td>
</tr>
</tbody>
</table>
EOS;
$oHTML = str_get_html($sHtml);
$oTRs = $oHTML->find('table tr');
$aData = array();
foreach($oTRs as $oTR) {
$aRow = array();
$oTDs = $oTR->find('td');
foreach($oTDs as $oTD) {
$aRow[] = trim($oTD->plaintext);
}
$aData[] = $aRow;
}
var_dump($aData);
?>
输出:
array
0 =>
array
0 => string 'data0' (length=5)
1 => string 'data1' (length=5)
2 => string 'data2' (length=5)
3 => string 'data3' (length=5)
4 => string 'data4' (length=5)
1 =>
array
0 => string 'data00' (length=6)
1 => string 'data11' (length=6)
2 => string 'data22' (length=6)
3 => string 'data33' (length=6)
4 => string 'data44' (length=6)
2 =>
array
0 => string 'data000' (length=7)
1 => string 'data111' (length=7)
2 => string 'data222' (length=7)
3 => string 'data333' (length=7)
4 => string 'data444' (length=7)
这篇关于仅从php中使用preg_match_all的html表获取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!