如何获取两个 HTML 标签之间的所有内容?(使用 XPath?) [英] How to get everything between two HTML tags? (with XPath?)

查看：131 发布时间：2021/7/17 18:43:17 php xpath screen-scraping

本文介绍了如何获取两个 HTML 标签之间的所有内容?(使用 XPath?)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我添加了一个适用于这种情况的解决方案.

EDIT : I've added a solution which works in this case.

我想从页面中提取一个表格，并且我想(可能)使用 DOMDocument 和 XPath 来执行此操作.但如果你有更好的主意，请告诉我.

I want to extract a table from a page and I want to do this (probably) with a DOMDocument and XPath. But if you've got a better idea, tell me.

我的第一次尝试是这样的(显然是错误的，因为它将获得第一个结束表标签):

My first attempt was this (obviously faulty, because it will get the first closing table tag):

<?php 
    $tableStart = strpos($source, '<table class="schedule"');
    $tableEnd   = strpos($source, '</table>', $tableStart);
    $rawTable   = substr($source, $tableStart, ($tableEnd - $tableStart));
?>

我很坚强，这可能可以通过 DOMDocument 和/或 xpath 解决...

I tough, this might be solvable with a DOMDocument and/or xpath...

最后，我想要标签(在本例中为标签)之间的所有内容，以及标签本身.因此，所有 HTML，而不仅仅是值(例如，不仅仅是值"而是值").还有一个'捕获'...

In the end I want everything between the tags (in this case, the tags), and the tags them self. So all HTML, not just the values (e.g. Not just 'Value' but 'Value'). And there is one 'catch'...

这张桌子里面有其他桌子.因此，如果您只搜索表格的末尾(标签")，您可能会得到错误的标签.
开始标签有一个类，您可以用它来识别它(classname = 'schedule').

这可能吗?

这是我想从另一个网站提取的(简化的)源代码:(我还想显示 html 标签，而不仅仅是值，所以整个表格都带有日程表"类)

This is the (simplified) source piece that I want to extract from another website: (I also want to display the html tags, not just the values, so the whole table with the class 'schedule')

<table class="schedule">
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- The problematic tag...
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- The problematic tag...
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- a problematic tag...

    This could even be variable content. =O =S

</table>

推荐答案

这会得到整个表格.但是可以修改它以让它抓取另一个标签.这是一个特定于案例的解决方案，只能在特定情况下使用.如果 html、php 或 css 注释包含开始或结束标记，则中断.谨慎使用.

This gets the whole table. But it can be modified to let it grab another tag. This is quite a case specific solution which can only be used onder specific circumstances. Breaks if html, php or css comments containt the opening or closing tag. Use it with caution.

功能:

// **********************************************************************************
// Gets a whole html tag with its contents.
//  - Source should be a well formatted html string (get it with file_get_contents or cURL)
//  - You CAN provide a custom startTag with in it e.g. an id or something else (<table style='border:0;')
//    This is recommended if it is not the only p/table/h2/etc. tag in the script.
//  - Ignores closing tags if there is an opening tag of the same sort you provided. Got it?
function getTagWithContents($source, $tag, $customStartTag = false)
{

    $startTag = '<'.$tag;
    $endTag   = '</'.$tag.'>';

    $startTagLength = strlen($startTag);
    $endTagLength   = strlen($endTag);

//      ***************************** 
    if ($customStartTag)
        $gotStartTag = strpos($source, $customStartTag);
    else
        $gotStartTag = strpos($source, $startTag);

    // Can't find it?
    if (!$gotStartTag)
        return false;       
    else
    {

//      ***************************** 

        // This is the hard part: finding the correct closing tag position.
        // <table class="schedule">
        //     <table>
        //     </table> <-- Not this one
        // </table> <-- But this one

        $foundIt          = false;
        $locationInScript = $gotStartTag;
        $startPosition    = $gotStartTag;

        // Checks if there is an opening tag before the start tag.
        while ($foundIt == false)
        {
            $gotAnotherStart = strpos($source, $startTag, $locationInScript + $startTagLength);
            $endPosition        = strpos($source, $endTag,   $locationInScript + $endTagLength);

            // If it can find another opening tag before the closing tag, skip that closing tag.
            if ($gotAnotherStart && $gotAnotherStart < $endPosition)
            {               
                $locationInScript = $endPosition;
            }
            else
            {
                $foundIt  = true;
                $endPosition = $endPosition + $endTagLength;
            }
        }

//      ***************************** 

        // cut the piece from its source and return it.
        return substr($source, $startPosition, ($endPosition - $startPosition));

    } 
}

函数的应用:

$gotTable = getTagWithContents($tableData, 'table', '<table class="schedule"');
if (!$gotTable)
{
    $error = 'Faild to log in or to get the tag';
}
else
{
    //Do something you want to do with it, e.g. display it or clean it...
    $cleanTable = preg_replace('|href=\'(.*)\'|', '', $gotTable);
    $cleanTable = preg_replace('|TITLE="(.*)"|', '', $cleanTable);
}

<小时>

在上面你可以找到我对我的问题的最终解决方案.在旧解决方案下方，我从中制作了一个通用功能.

Above you can find my final solution to my problem. Below the old solution out of which I made a function for universal use.

旧解决方案:

// Try to find the table and remember its starting position. Check for succes.
// No success means the user is not logged in.
$gotTableStart = strpos($source, '<table class="schedule"');
if (!$gotTableStart)
{
    $err = 'Can\'t find the table start';
}
else
{

//      ***************************** 
    // This is the hard part: finding the closing tag.
    $foundIt          = false;
    $locationInScript = $gotTableStart;
    $tableStart       = $gotTableStart;

    while ($foundIt == false)
    {
        $innerTablePos = strpos($source, '<table', $locationInScript + 6);
        $tableEnd      = strpos($source, '</table>', $locationInScript + 7);

        // If it can find '<table' before '</table>' skip that closing tag.
        if ($innerTablePos != false && $innerTablePos < $tableEnd)
        {               
            $locationInScript = $tableEnd;
        }
        else
        {
            $foundIt  = true;
            $tableEnd = $tableEnd + 8;
        }
    }

//      ***************************** 

    // Clear the table from links and popups...
    $rawTable   = substr($tableData, $tableStart, ($tableEnd - $tableStart));

}

这篇关于如何获取两个 HTML 标签之间的所有内容?(使用 XPath?)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何获取两个 HTML 标签之间的所有内容?(使用 XPath?) [英] How to get everything between two HTML tags? (with XPath?)

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

如何获取两个 HTML 标签之间的所有内容?(使用 XPath?) [英] How to get everything between two HTML tags? (with XPath?)

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭