是否可以检查两组是否相等? [英] Is it possible to check if two groups are equal?

查看:45
本文介绍了是否可以检查两组是否相等?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有这样的 HTML:

If I have some HTML like this:

 <b>1<i>2</i>3</b>

以及以下正则表达式:

 \<[^\>\/]+\>(.*?)\<\/[^\>]+\>

然后它会匹配:

 <b>1<i>2</i>

我希望它只匹配开始和结束标记相同的 HTML.有没有办法做到这一点?

I want it to only match HTML where the start and end tags are the same. Is there a way to do this?

谢谢,

推荐答案

有没有办法做到这一点?

Is there a way to do this?

是的,当然.忽略那些告诉你它无法完成的轻率的非答案.它肯定可以.正如我在下面解释的那样,您可能不希望这样做.

Yes, certainly. Ignore those flippant non-answers that tell you it can’t be done. It most certainly can. You just may not wish to do so, as I explain below.

假装HTML <i><b> 标签总是没有属性,而且,既不重叠也不嵌套,我们有这个简单的解决方案:

Pretending for the nonce that HTML <i> and <b> tags are always denude of attributes, and moreover, neither overlap nor nest, we have this simple solution:

#!/usr/bin/env perl
#
# solution A: numbered captures
#
use v5.10;
while (<>) {
    say "$1: $2" while m{
          < ( [ib] ) >
          (
              (?:
                  (?!  < /? \1  > ) .
              ) *
          )
          </ \1  >
    }gsix;
}

运行时,产生这个:

$ echo 'i got <i>foo</i> and <b>bar</b> bits go here' | perl solution-A
i: foo
b: bar

命名捕获

最好使用命名捕获,这导致了这个等效的解决方案:

Named Captures

It would be better to use named captures, which leads to this equivalent solution:

#!/usr/bin/env perl
#
# Solution B: named captures
#
use v5.10;
while (<>) {
    say "$+{name}: $+{contents}" while m{      
          < (?<name> [ib] ) >
          (?<contents>
              (?:
                  (?!  < /? \k<name>  > ) .
              ) *
          )
          </ \k<name>  >
    }gsix;
}

递归捕获

当然,假设这样的标签既不重叠也不嵌套是不合理的.由于这是递归数据,因此需要递归模式来解决.记住递归解析嵌套括号的简单模式很简单:

Recursive Captures

Of course, it is not reasonable to assume that such tags neither overlap nor nest. Since this is recursive data, it therefore requires a recursive pattern to solve. Remembering that the trival pattern to parse nested parens recursively is simply:

( \( (?: [^()]++ | (?-1) )*+ \) )

我将在之前的解决方案中构建这种递归匹配,并且我还将进一步进行一些交互处理以解开内部位.

I’ll build that sort of recursive matching into the previous solution, and I’ll further toss in a bit interative processing to unwrap the inner bits, too.

#!/usr/bin/perl
use v5.10;
# Solution C: recursive captures, plus bonus iteration 
while (my $line = <>) {
    my @input = ( $line );
    while (@input) { 
        my $cur = shift @input;
        while ($cur =~ m{      
                          < (?<name> [ib] ) >
                          (?<contents>
                              (?:
                                    [^<]++
                                  | (?0)
                                  | (?!  </ \k<name>  > )
                                     .
                              ) *+
                          )
                          </ \k<name>  >
               }gsix)
        {
            say "$+{name}: $+{contents}";
            push @input, $+{contents};
        } 
    }
}

当演示会产生这个时:

$ echo 'i got <i>foo <i>nested</i> and <b>bar</b> bits</i> go here' | perl Solution-C
i: foo <i>nested</i> and <b>bar</b> bits
i: nested
b: bar

这仍然相当简单,所以如果它适用于您的数据,那就去做吧.

That’s still fairly simple, so if it works on your data, go for it.

然而,它实际上并不知道正确的 HTML 语法,它允许诸如 之类的东西的标签属性.

However, it doesn’t actually know about proper HTML syntax, which admits tag attributes to things like <i> and <b>.

这个答案中所述,当然可以使用正则表达式来解析标记语言,前提是要小心.

As explained in this answer, one can certainly use regexes to parse markup languages, provided one is careful about it.

例如,这知道与 (或 )标签密切相关的属性.在这里,我们定义了用于构建语法正则表达式的正则表达式子例程.这些只是定义,就像定义常规 subs 但现在用于正则表达式:

For example, this knows the attributes germane to the <i> (or <b>) tag. Here we defined regex subroutines used to build up a grammatical regex. These are definitions only, just like defining regular subs but now for regexes:

(?(DEFINE)   # begin regex subroutine defs for grammatical regex

    (?<i_tag_end> < / i > )

    (?<i_tag_start> < i (?&attributes) > )

    (?<attributes> (?: \s* (?&one_attribute) ) *)

    (?<one_attribute>
        \b
        (?&legal_attribute)
        \s* = \s* 
        (?:
            (?&quoted_value)
          | (?&unquoted_value)
        )
    )

    (?<legal_attribute> 
          (?&standard_attribute) 
        | (?&event_attribute)
    )

    (?<standard_attribute>
          class
        | dir
        | ltr
        | id
        | lang
        | style
        | title
        | xml:lang
    )

    # NB: The white space in string literals 
    #     below DOES NOT COUNT!   It's just 
    #     there for legibility.

    (?<event_attribute>
          on click
        | on dbl   click
        | on mouse down
        | on mouse move
        | on mouse out
        | on mouse over
        | on mouse up
        | on key   down
        | on key   press
        | on key   up
    )

    (?<nv_pair>         (?&name) (?&equals) (?&value)         ) 
    (?<name>            \b (?=  \pL ) [\w\-] + (?<= \pL ) \b  )
    (?<equals>          (?&might_white)  = (?&might_white)    )
    (?<value>           (?&quoted_value) | (?&unquoted_value) )
    (?<unwhite_chunk>   (?: (?! > ) \S ) +                    )
    (?<unquoted_value>  [\w\-] *                              )
    (?<might_white>     \s *                                  )
    (?<quoted_value>
        (?<quote>   ["']      )
        (?: (?! \k<quote> ) . ) *
        \k<quote> 
    )
    (?<start_tag>  < (?&might_white) )
    (?<end_tag>          
        (?&might_white)
        (?: (?&html_end_tag) 
          | (?&xhtml_end_tag) 
         )
    )
    (?<html_end_tag>       >  )
    (?<xhtml_end_tag>    / >  )

)

一旦您组装好了语法的各个部分,您就可以将这些定义合并到已经给出的递归解决方案中,以便做得更好.

Once you have the pieces of your grammar assembled, you could incorporate those definitions into the recursive solution already given to do a much better job.

然而,还有一些事情没有考虑,在更一般的情况下必须考虑.这些已经在更长的解决方案中进行了演示提供.

However, there are still things that haven’t been considered, and which in the more general case must be. Those are demonstrated in the longer solution already provided.

我只能想到三个可能的原因,为什么您可能不在乎使用正则表达式来解析一般 HTML:

I can think of only three possible reasons why you might not care to use regexes for parsing general HTML:

  1. 您使用的是贫乏的正则表达式语言,而不是现代语言,因此您必须求助于基本的现代便利,例如递归匹配或语法模式.
  2. 您可能认为递归和语法模式等概念太复杂而无法轻松理解.
  3. 您希望其他人为您完成所有繁重的工作,包括繁重的测试,因此您更愿意使用单独的 HTML 解析模块,而不是自行开发.

其中任何一项或多项都可能适用.在这种情况下,不要这样做.

Any one or more of those might well apply. In which case, don’t do it this way.

对于简单的罐头示例,这条路线很简单.您希望它在您以前从未见过的事物上运行得越健壮,这条路线就越难.

For simple canned examples, this route is easy. The more robust you want this to work on things you’ve never seen before, the harder this route becomes.

当然,如果您使用的是 Python 甚至更糟的 Javascript 等语言附带的低劣、贫乏的模式匹配,那么您将无法执行任何操作.这些几乎不比 Unix grep 程序好,在某些方面,甚至更糟.不,您需要一个现代模式匹配引擎,例如 Perl 或 PHP 中的引擎,才能开始这条路.

Certainly you can’t do any of it if you are using the inferior, impoverished pattern matching bolted onto the side of languages like Python or even worse, Javascript. Those are barely any better than the Unix grep program, and in some ways, are even worse. No, you need a modern pattern matching engine such as found in Perl or PHP to even start down this road.

但老实说,让别人为你做这件事可能更容易,我的意思是你应该使用一个已经编写好的解析模块.

But honestly, it’s probably easier just to get somebody else to do it for you, by which I mean that you should probably use an already-written parsing module.

不过,理解为什么不用这些基于正则表达式的方法(至少,不超过一次)需要您首先使用正则表达式正确实现正确的 HTML 解析.你需要了解它的全部内容.因此,像这样的小练习有助于提高您对问题空间和一般现代模式匹配的整体理解.

Still, understanding why not to bother with these regex-based approaches (at least, not more than once) requires that you first correctly implement proper HTML parsing using regexes. You need to understand what it is all about. Therefore, little exercises like this are useful for improving your overall understanding of the problem-space, and of modern pattern matching in general.

这个论坛的格式并不适合解释有关现代模式匹配的所有这些事情.不过,有些书籍做得相当好.

This forum isn’t really in the right format for explaining all these things about modern pattern-matching. There are books, though, that do so equitably well.

这篇关于是否可以检查两组是否相等?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆