使用PHP提取HTML文档的正文 [英] Extracting the body text of an HTML document using PHP

查看：129 发布时间：2020/6/18 19:18:11 php regex text text-processing html-content-extraction

本文介绍了使用PHP提取HTML文档的正文的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我知道为此目的最好使用DOM，但让我们尝试以这种方式提取文本:

I know it's better to use DOM for this purpose but let's try to extract the text in this way:

<?php


$html=<<<EOD
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
EOD;


        preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);

        if (empty($matches))
            exit;

        $matched_body_start_tag = $matches[0][0];
        $index_of_body_start_tag = $matches[0][1];

        $index_of_body_end_tag = strpos($html, '</body>');


        $body = substr(
                        $html,
                        $index_of_body_start_tag + strlen($matched_body_start_tag),
                        $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
        );

echo $body;

结果可以在这里看到: http://ideone.com/vH2FZ

The result can be seen here: http://ideone.com/vH2FZ

如您所见，我收到的文字超出预期.

As you can see, I am getting more text than expected.

我正在使用一些我不理解的东西，以获取substr($string, $start, $length)函数的正确长度，

There is something I don't understand, to get the correct length for the substr($string, $start, $length) function, I am using:

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

我认为这个公式没什么问题.

I don't see anything wrong with this formula.

有人可以建议问题出在哪里吗?

Could somebody kindly suggest where the problem is?

非常感谢大家.

非常感谢大家.我的脑子里只有一个虫子.阅读您的答案后，我现在了解了问题所在，应该是:

Thank you very very much to all of you. There is just a bug in my brain. After reading your answers, I now understand what the problem is, it should either be:

  $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));

或者:

  $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);

推荐答案

问题是您的字符串在where中有新行.在模式仅匹配单行的情况下，您需要添加/s修饰符以使.匹配多行

The problem is that your string have new lines where . in the pattern only matches single lines, you need to add /s modifier to make . to match multi-lines

这是我的解决方案，我更喜欢这种方式.

Here is my solution, I prefer it this way.

<?php

$html=<<<EOD
<html>
<head>
</head>
<body buu="grger"     ga="Gag">
<p>Some text</p>
</body>
</html>
EOD;

    // get anything between <body> and </body> where <body can="have_as many" attributes="as required">
    if (preg_match('/(?:<body[^>]*>)(.*)<\/body>/isU', $html, $matches)) {
        $body = $matches[1];
    }
    // outputing all matches for debugging purposes
    var_dump($matches);
?>

我正在更新我的答案，以便为您提供更好的解释，为什么您的代码会失败.

I am updating my answer to provide you with better explanation why your code fails.

您有以下字符串:

<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>

一切似乎都不错，但实际上每一行上都有非打印字符(换行符). 您有53个可打印的字符和7个不可打印的字符(换行，\ n ==每个换行实际上是2个字符).

Everything seems to be fine with it but actually you have non-print characters (new line characters) on each line. You have 53 printable characters and 7 non printable (new lines, \n == 2 characters actually for each new line).

当您到达代码的这一部分时:

When you reach this part of the code:

$index_of_body_end_tag = strpos($html, '</body>');

您获得了正确的</body>位置. (从位置51开始)，但是这会计算新行.

You get the correct position of </body> (starting at position 51) but this counts the new lines.

因此，当您到达以下代码行时:

So when you reach this line of code:

$index_of_body_start_tag + strlen($matched_body_start_tag)

它的评估结果为31(包括新行)，并且:

It it evaluated to 31 (new lines included), and:

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

它的计算结果为51-25 + 6 = 32(您必须阅读的字符)，但在< body>和和</body>和4个不可打印的字符(< body>之后的新行和</body>之前的新行).这是问题所在，您必须像这样对计算进行分组(优先级排序):

It is evaluated to 51 - 25 + 6 = 32 (characters you have to read) but you only have 16 printable characters of text between <body> and </body> and 4 non printable characters (new line after <body> and new line before </body>). And here is the problem, you have to group the calculation (prioritize) like so:

$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))

评估为51-(25 + 6)= 51-31 = 20(16 + 4).

evaluated to 51 - (25 + 6) = 51 - 31 = 20 (16 + 4).

:)希望这可以帮助您理解为什么优先排序很重要. (很抱歉误导您关于换行符，它仅在我上面给出的正则表达式示例中有效).

:) Hope this helps you to understand why prioritizing is important. (Sorry for misleading you about newlines it is only valid in regex example I gave above).

这篇关于使用PHP提取HTML文档的正文的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用PHP提取HTML文档的正文 [英] Extracting the body text of an HTML document using PHP

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

使用PHP提取HTML文档的正文 [英] Extracting the body text of an HTML document using PHP

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭