正则表达式从所有元标记中拉出所有属性 [英] regex to pull all attributes out of all meta tags

查看:95
本文介绍了正则表达式从所有元标记中拉出所有属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从HTML页面中提取元标记,以比较两个页面(实时页面和开发页面),以查看在站点重新设计/重构后,它们的SEO是否相同.我需要比较标题,元标记(描述,opengraph等),h1,我们的分析(Omniture)和我们的广告标记(doubleclick)都是相同的.

I'm trying to pull meta tags out of a html page, to compare two pages (live and dev) to see if they're SEO is the same after a site redesign/refactor. I need to compare title, meta tags (description, opengraph etc.), h1's, our analytics (Omniture), and our ad tags (doubleclick) are all the same.

我的问题是获取元标记 http://php.net/manual/zh/function.get-meta -tags.php 仅当它们具有name =属性时才有效,与在cricava dot com上的mariano"的解决方案相同.

My problem is getting meta tags http://php.net/manual/en/function.get-meta-tags.php only works if they have a name= attribute, same with "mariano at cricava dot com"'s solution.

我不想将其限制为具有某些属性,我可以假设我们所有的元标记都具有name =或property =或http-equiv =并适当地更改了正则表达式,但不能完全确定因为这是一个庞大的网站,并且标签中可能包含任何乱七八糟的东西(因此该工具可以检查这些东西!),并希望使其保持尽可能的动态.

I don't want to restrict it to having certain attributes, I could make the assumption that all our meta tags have either a name=, or property= or http-equiv= and change the regex appropriately but cannot be entirely sure as it's a massive website and any random crap could be in the tags (hence this tool is to check this stuff!) and would like to leave it as dynamic as possible.

我有

$page = @file_get_contents('http://.../');
preg_match_all('#<meta(?:\s+?([^\=]+)\=\"(.+?)\")+?\s*?/?>#sui', $page, $matches, PREG_SET_ORDER)

但是子模式彼此覆盖,因此这只会拉出最后一个attribute-name = attribute-value对

but the subpatterns override each other, so this only pulls out the last attribute-name=attribute-value pair

Array
(
    [0] => Array
        (
            [0] => <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
            [1] => content
            [2] => text/html; charset=UTF-8
        )

    [1] => Array
        (
            [0] => <meta name="description" content="some description" />
            [1] => content
            [2] => some description
        )

    [2] => Array
        (
            [0] => <meta property="og:type" content="website" />
            [1] => content
            [2] => website
        )
...

我需要所有meta标签的所有属性.我可以分两步进行操作,提取<meta ([^>]*)>的内容,然后对结果进行第二个正则表达式,但是使用regex的功能似乎没有必要?

I need all the attributes for all the meta tags. I could do this in two steps, pulling the contents of <meta ([^>]*)> then doing a second regular expression on the results, but that seems unnecessary with the power of regex?

推荐答案

但是回到最初的问题,现在忘记它的HTML了吗? 无法在preg_match_all中返回重复的子模式 不仅仅是返回最后一场比赛?

But back to the original question, forget it's HTML for now, is there no way to have recurring subpatterns return in preg_match_all rather than just returning the last match?

使用preg_*/PCRE是不可能的(我不知道其他任何正则表达式样式,但是在Perl中,您可以使用(?{ push @list, $^N }) hack).

Not possible with preg_*/PCRE (nor any other regex flavor that I know of, but in Perl you could use a (?{ push @list, $^N }) hack).

这篇关于正则表达式从所有元标记中拉出所有属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆