通过preg_match_all PHP函数从html代码字符串中提取img标签 [英] extract img tag from a html code string through preg_match_all PHP function

查看:67
本文介绍了通过preg_match_all PHP函数从html代码字符串中提取img标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些 html 代码并从中提取了 img src 属性.在 html 字符串中有一些像这样的 img:

I've some html code and extracted the img src attribute from it. Into the html string there are some img like this:

<img src="http://www.pecso.it/wp-content/uploads/2016/12/10_WRAS.png">

我尝试使用以下 PHP 代码执行此操作:

I've tried to do this with the following PHP code:

$description = wpautop($this->data->description);
$description = preg_replace("/\[[^\]]+\]/", '', $description);
     if (preg_match_all("<img src=(.*?)>", $description, $match)) {
          echo match;
            };

结果为NULL.

你能帮我吗?

推荐答案

不要在 html 上使用正则表达式!

Do not use regex on html!

改用 dom 解析器,因为它更省事.

Use a dom parser instead as it is much more hassle free.

$html = file_get_contents("you_file.html");

$dom  = new \DOMDocument();
$dom->loadHTML($html);

$dom->preserveWhiteSpace = false;

$images = [];
foreach ($dom->getElementsByTagName('img') as $image) {
    $images[] = $image->getAttribute('src');
}

您正在使用 wpautop 函数来清理描述.根据文档,它需要 要格式化的文本. 作为第一个参数.所以首先要确保它确实保留了参数中的图像标签.

You are using the wpautop function to clean up the description. According to the documetation it requires the The text to be formatted. as first argument. So first make sure that it does preserve the image tags inside the argument.

因为我假设保留了标签.查看正则表达式本身,我发现它匹配得太少了.

As I assume that tags are preserved. Looking at the regex itself, I see that it's matching too little.

您正在匹配捕获组内的 .*?.? 表示使用惰性匹配,即根据需要匹配尽可能少的字符.所以 .* 将匹配任何字符,零个或多个.而 ? 将根据需要匹配尽可能少的内容.

You are matching .*? inside the capuring group. The ? indicates to use lazy matching, which means match as few characters as needed. So .* will match any character, zero or more. And ? will match as few as needed.

在我为 $match 输出的 var_dump 输出中,我看到它找到了匹配项.

In my ouptut of var_dump for $match I see that it found a match.

array (size=2)   0 => 
    array (size=1)
      0 => string 'img src=' (length=8)   1 => 
    array (size=1)
      0 => string '' (length=0)

然而,第一个匹配组的大小为 0.因为惰性匹配.我假设和内部 php 错误.它应该匹配到 > 的所有内容,因为这也是正则表达式的一部分.但似乎 php 忽略了这部分.

However the first matching group is of size 0. Because of the lazy matching. And I assume and internal php error. It should match everthing up to > because this is also part of the regex. But it seems php is ignoring this part.

如果将捕获组更改为 .+?,则第一组将包含单个 " 字符.由于 +表示一个或多个"字符.

If you change the capturing group to .+?, the first group will contain a single " character. Because of the + which means "one or more" characters.

解决方案是更改代码,使其包含引号.

A solution would be to change the code so it includes the quotation marks.

if (preg_match_all("<img src=\"(.*?)\">", $description, $match)) {

这匹配所需的图像链接:

This matches the desired image link:

http://www.pecso.it/wp-content/uploads/2016/12/10_WRAS.png

我建议尝试使用 DOMDocument 方法,因为此代码更有可能更稳定和可扩展.如果您想了解正则表达式,解析 html 可能不是最好的开始.

I would recommend try using the DOMDocument approach as it's more likely this code will be more stable and extendable. If you want to learn about regex, parsing html might not be the best thing to start with.

所有这些代码都是使用 php 5.4 测试的,新版本可能会有所不同!

All this code was tested using php 5.4, it might be diffrent for newer versions!

这篇关于通过preg_match_all PHP函数从html代码字符串中提取img标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆