首页
PHP
如何从HTML中使用PHP提取img src，title和alt？

如何从HTML中使用PHP提取img src，title和alt？ [英] How to extract img src, title and alt from html using php?

查看：85 发布时间：2018/6/13 9:25:03 php html regex html-parsing html-content-extraction

本文介绍了如何从HTML中使用PHP提取img src，title和alt？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想创建一个页面，其中驻留在我的网站上的所有图像都以标题和替代表示形式列出。

我已经给我写了一个小程序来查找并加载所有的HTML文件，但现在我坚持如何提取 src ， title 和 alt 来自HTML：

 < img  src  = /image/fluffybunny.jpg title  =Harvey the bunny alt  =一个可爱的小蓬松兔子/>

我想这应该用一些正则表达式来完成，但由于标签的顺序可能会有所不同，并且我需要所有这些标签，所以我不知道如何解析这是一个优雅的方式（我可以通过char方式来实现硬焦点，但这很痛苦）。 解决方案

我知道更好

使用regexp解决这类问题是一个坏主意，并可能导致无法维护和不可靠的代码。最好使用 HTML解析器。

解决方案使用正则表达式

在这种情况下，最好将流程拆分为两部分：

获取所有img标签

提取元数据

我将假定您的文档不是xHTML严格的，所以您不能使用XML解析器。例如。与此网页源代码：

/ * preg_match_all匹配所有$ html字符串中的正则表达式，并将所有内容输出为 $结果中的数组。 i选项用于使其不区分大小写* / preg_match_all（'/< img [^> +> / i'，$ html，$ result）; print_r（$ result）; 数组（ [0] =>数组（ [0] =>< img src =/ Content / Img / stackoverflow-logo -250.pngwidth =250height =70alt =logo link to homepage/> [1] =>< img class =vote-upsrc =/ content / img / vote-arrow-up.pngalt =vote uptitle =This was helpful（click again to undo）/> [2] =>< img class =投票下来src =/ content / img / vote-arrow-down.pngalt =vote downtitle =这没有帮助（再次点击撤消）/> [3] =>< IMG SRC = http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG 高度= 32宽度= 32 ALT = 的gravatar图像/> [4] =>< img class =vote-upsrc =/ content / img / vote-arrow-up.pngalt =vote uptitle =This was helpful（click再次撤消）/> [...] ））
然后我们用一个循环获得所有的img标签属性：
$ img = array（）; foreach（$ result as $ img_tag） { preg_match_all（'/（alt | title | src）=（[^] *）/ i'，$ img_tag，$数组（ [< img src = /Content/Img/stackoverflow-logo-250.pngwidth =250height =70alt =logo link to homepage/>] => Array （ [ 0] =>数组（ [0] => src =/ Content / Img / stackoverflow-logo-250.png [1] => alt =标识链接到主页） [1] =>数组（ [0] => src [1] => ; alt ） [2] =>数组（ [0] =>/Content/Img/stackoverflow-logo-250.png [1] =>logo link to homepage ）） [< img class =vote-u psrc =/ content / img / vote-arrow-up.pngalt =vote uptitle =这是有帮助的（再次点击撤消）/>] =>数组（ [0] =>数组（ [0] => src =/ content / img / vote-arrow-up.png $ 1 $ b $ 1 $ 2 $ = $ [1] =>数组（ [0] => src [1] => alt [2] => title ） [2] =>数组（ [0] =>/content/img/vote-arrow-up.png [1 ] =>投票 [2] =>这是有帮助的（再次点击以撤销））） [< img class =vote-downsrc =/ content / img / vote-arrow-down.pngalt =vote downtitle =这没什么帮助（再次点击撤消）/> ;] =>数组（ [0] =>数组（ [0] => src =/ content / img / vote-arrow-down.png $ b $ 1 $ 1 $ = $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ b [1] =>数组（ [0] => src [1] => alt [2] => title ） [2] =>数组（ [0] =>/content/img/vote-arrow-down.png [ 1] =>投下 [2] =>这没有帮助（再次点击以撤销））） [< IMG SRC = http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG 高度= 32宽度= 32 ALT = 的gravatar图像/> ;] =>数组（ [0] =>数组（ [0] => src =http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31？s = b&d = identicon& r = PG [1] => alt =gravatar image ） [1] => Array （ [0] => src [1] => alt ） [2] => Array b $ b [0] =>http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG [1] =>gravatar image ）） [..] ））
正则表达式是CPU密集型的，所以你可能想要缓存这个页面。如果你没有缓存系统，你可以使用 ob_start 和从文本文件中加载/保存。

这个东西是如何工作的？

首先，我们使用 preg_ match_ all ，这个函数可以获取每个字符串匹配模式并输出它的第三个参数。

正则表达式：
< IMG [^>] +>
我们将它应用于所有html网页。它可以读作为以< img 开始的每个字符串，包含非>字符，并以> 结尾。 p>

（alt | title | src）=（[^] *）
我们将它连续应用于每个img标签，它可以被读为每个以alt，title或src开头的字符串，，然后是=，然后是一个'''，一堆不是'''并以'''结尾的东西。隔离（）之间的子字符串。

最后，每次你想处理正则表达式时，都会有很好的工具来快速测试它们。检查这个在线正则表达式测试器。

编辑：回答到第一条评论。

确实，我没有考虑使用单引号（希望很少）的人。

如果你只使用'，那么只需要替换所有'by'。

如果你混合使用，首先你应该拍自己:-)，然后尝试使用（|'）代替或和[^ø]代替[^]。

I would like to create a page where all images which reside on my website are listed with title and alternative representation.

I already wrote me a little program to find and load all HTML files, but now I am stuck at how to extract src, title and alt from this HTML:
<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />
I guess this should be done with some regex, but since the order of the tags may vary, and I need all of them, I don't really know how to parse this in an elegant way (I could do it the hard char by char way, but that's painful).
解决方案
EDIT : now that I know better

Using regexp to solve this kind of problem is a bad idea and will likely lead in unmaintainable and unreliable code. Better use an HTML parser.

Solution With regexp

In that case it's better to split the process into two parts :

get all the img tag

extract their metadata

I will assume your doc is not xHTML strict so you can't use an XML parser. E.G. with this web page source code :
/* preg_match_all match the regexp in all the $html string and output everything as an array in $result. "i" option is used to make it case insensitive */ preg_match_all('/<img[^>]+>/i',$html, $result); print_r($result); Array ( [0] => Array ( [0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" /> [1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" /> [2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" /> [3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" /> [4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" /> [...] ) )
Then we get all the img tag attributes with a loop :
$img = array(); foreach( $result as $img_tag) { preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]); } print_r($img); Array ( [<img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />] => Array ( [0] => Array ( [0] => src="/Content/Img/stackoverflow-logo-250.png" [1] => alt="logo link to homepage" ) [1] => Array ( [0] => src [1] => alt ) [2] => Array ( [0] => "/Content/Img/stackoverflow-logo-250.png" [1] => "logo link to homepage" ) ) [<img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />] => Array ( [0] => Array ( [0] => src="/content/img/vote-arrow-up.png" [1] => alt="vote up" [2] => title="This was helpful (click again to undo)" ) [1] => Array ( [0] => src [1] => alt [2] => title ) [2] => Array ( [0] => "/content/img/vote-arrow-up.png" [1] => "vote up" [2] => "This was helpful (click again to undo)" ) ) [<img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />] => Array ( [0] => Array ( [0] => src="/content/img/vote-arrow-down.png" [1] => alt="vote down" [2] => title="This was not helpful (click again to undo)" ) [1] => Array ( [0] => src [1] => alt [2] => title ) [2] => Array ( [0] => "/content/img/vote-arrow-down.png" [1] => "vote down" [2] => "This was not helpful (click again to undo)" ) ) [<img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />] => Array ( [0] => Array ( [0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" [1] => alt="gravatar image" ) [1] => Array ( [0] => src [1] => alt ) [2] => Array ( [0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" [1] => "gravatar image" ) ) [..] ) )
Regexps are CPU intensive so you may want to cache this page. If you have no cache system, you can tweak your own by using ob_start and loading / saving from a text file.

How does this stuff work ?

First, we use preg_ match_ all, a function that gets every string matching the pattern and ouput it in it's third parameter.

The regexps :
<img[^>]+>
We apply it on all html web pages. It can be read as every string that starts with "<img", contains non ">" char and ends with a >.
(alt|title|src)=("[^"]*")
We apply it successively on each img tag. It can be read as every string starting with "alt", "title" or "src", then a "=", then a ' " ', a bunch of stuff that are not ' " ' and ends with a ' " '. Isolate the sub-strings between ().

Finally, every time you want to deal with regexps, it handy to have good tools to quickly test them. Check this online regexp tester.

EDIT : answer to the first comment.

It's true that I did not think about the (hopefully few) people using single quotes.

Well, if you use only ', just replace all the " by '.

If you mix both. First you should slap yourself :-), then try to use ("|') instead or " and [^ø] to replace [^"].

这篇关于如何从HTML中使用PHP提取img src，title和alt？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

相关文章

如何使用php从html中提取img src、title和alt?;

如何使用 php 从 html 中提取 img src、title 和 alt?;

BeautifulSoup：提取IMG ALT数据;

BeautifulSoup:提取 img alt 数据;

PHP网址转换为html img src;

使用BeautifulSoup从img标签中提取src属性;

使用BeautifulSoup从`img`标记中提取`src`属性;

使用 mysql 查询从 img src 中提取 URL;

使用 BeautifulSoup 从 img 标签中提取 src 属性;

使用 PHP 获取 img src;

PHP从xml获取img src;

如何从代码后面设置html img src;

在PHP中从HTML中提取所有文本和img标签。;

如何从数据库列中提取img src？;

如何使用VBA从img获取alt值;

Owl Carousel 2 - 字幕div（img title& alt标签）;

我可以在SVG元素中使用alt和title属性吗？;

用PHP获取img src;

如何在vue.js中使用img src？;

如何在 Vue.js 中使用“img src"?;

如何使用python通过beautifulsoup中的lxml从网页中提取img src?;

使用PHP和xPath从HTML提取数据;

调用图像< img src =" .PHP&QUOT;隐藏img src;

JavaScript - 从HTML img src获取字节大小;

使用'alt'值动态设置'title';

PHP最新文章

请求头字段Access-Control-Allow-Headers在预检响应中不允许Access-Control-Allow-Headers;

路由问题导致Symfony \ Component \ HttpKernel \ Exception \ NotFoundHttpException错误;

什么是NCFB和NOFB模式？;

警告：mysqli_connect（）：（HY000 / 1045）：访问被拒绝用户'用户名'@'localhost'（使用密码：是）;

如何处理致命错误：cURL错误7：无法连接到xxxx端口443;

参数3传递给GuzzleHttp\Client :: request（）必须是数组类型，给定字符串;

phpMyAdmin的＃2054无法登录到MySQL服务器;

SSL错误SSL3_GET_SERVER_CERTIFICATE：证书验证失败;

在PHPExcel中设置字体颜色，字体和字体大小;

如何解决cURL错误（7）：无法连接到主机？;

热门教程

Java教程

Apache ANT 教程

Kali Linux教程

JavaScript教程

JavaFx教程

MFC 教程

Apache HTTP客户端教程

Microsoft Visio 教程

热门工具

Java 在线工具

C(GCC) 在线工具

PHP 在线工具

C# 在线工具

Python 在线工具

MySQL 在线工具

VB.NET 在线工具

Lua 在线工具

Oracle 在线工具

C++(GCC) 在线工具

Go 在线工具

Fortran 在线工具

登录关闭

扫码关注1秒登录

发送“验证码”获取 | 15天全站免登陆

友情链接： IT屋 Chrome插件谷歌浏览器插件

IT屋 ©2016-2022 琼ICP备2021000895号-1 站点地图站点标签 SiteMap <免责申明> 本站内容来源互联网,如果侵犯您的权益请联系我们删除.