正则表达式模式不匹配字符串中的任何位 [英] Regular expression pattern not matching anywhere in string

查看:383
本文介绍了正则表达式模式不匹配字符串中的任何位的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用这种模式匹配< input> 输入hidden字段:

  /< input type =hiddenname =([^] *?)value =([^] *?)/> / 

这是样本表单数据:

 <$ < input type =hiddenname =__ VIEWSTATE1value =1H4sIAAtzrkX7QfL5VEGj6nGi + nP/>< input type =hiddenname =__ VIEWSTATE2value =0351118MK/>< input type =hiddenname =__ VIEWSTATE3value =ZVVV91yjY/>< input type =hiddenname = __VIEWSTATE0value =3/>< input type =hiddenname =__ VIEWSTATEvalue =/>< input type =hiddenname =__ VIEWSTATEvalue =/> ; 

但我不确定类型 name value 属性将始终以相同的顺序出现。如果类型的属性最后一次出现,则匹配将失败,因为在我的模式中它是在开始的时候。



问题:

如何更改我的模式,以便与< input> 中的属性位置无关,标签?



PS:顺便说一下,我使用基于 RegEx Desktop Tool 用于测试正则表达式。

解决方案

与这里的所有答案相反,对于你要做的事情,正则表达式是一个完美有效的解决方案。这是因为你并不想匹配平衡标签 - 这对于正则表达式是不可能的!但是,您只能匹配一个标签中的内容,而且这非常正常。



不过,这是问题所在。你不能只用一个正则表达式来做......你需要做一个匹配来捕获一个< input> 标签,然后对其进行进一步处理。请注意,只有当属性值中都没有> 字符时,这才会起作用,所以它并不完美,但它应该足够用于输入。



这里有一些Perl(伪)代码来向你展示我的意思:

  my $ html = readLargeInputFile(); 

my @input_tags = $ html =〜m /

< input#以< input开头
(?= [^>] *?type =hidden)#使用lookahead来确保type =hidden
[^>] +#抓住标签的其余部分...
\ /># ...除了/>,它在这里被抓住
)/ xgm;

#现在@input_tags的每个成员就像< input type =hiddenname =SaveRequiredvalue =False/>

foreach我的$ input_tag(@input_tags)
{
my $ hash_ref = {};
#现在每次提取一个字段。
$ b $($ hash_ref-> {name})= $ input_tag =〜/ name =([^] *)/;
($ hash_ref-> { value})= $ input_tag =〜/ value =([^] *)/;

#把$ hash_ref放在一个列表或其他东西中,或者以其他方式处理它
}

这里的基本原则是,不要试图用一个正则表达式做太多。正如你注意到的那样,正则表达式强制执行一定的顺序。所以你需要做的是首先匹配你想要提取的内容的CONTEXT,然后对你想要的数据进行submatching。

编辑:然而,我同意一般情况下,使用HTML解析器可能更容易,更好,您应该考虑重新设计您的代码或重新检查您的目标。 :-)但是我不得不把这个答案作为对膝盖反应的反应,即解析HTML的任何子集是不可能的:当你考虑整个规范时,HTML和XML都是不规则的,但是标签的规范是正规的,当然在PCRE的力量之内。

I am trying to match <input> type "hidden" fields using this pattern:

/<input type="hidden" name="([^"]*?)" value="([^"]*?)" />/

This is sample form data:

<input type="hidden" name="SaveRequired" value="False" /><input type="hidden" name="__VIEWSTATE1" value="1H4sIAAtzrkX7QfL5VEGj6nGi+nP" /><input type="hidden" name="__VIEWSTATE2" value="0351118MK" /><input type="hidden" name="__VIEWSTATE3" value="ZVVV91yjY" /><input type="hidden" name="__VIEWSTATE0" value="3" /><input type="hidden" name="__VIEWSTATE" value="" /><input type="hidden" name="__VIEWSTATE" value="" />

But I am not sure that the type, name, and value attributes will always appear in the same order. If the type attribute comes last, the match will fail because in my pattern it’s at the start.

Question:
How can I change my pattern so it will match regardless of the positions of the attributes in the <input> tag?

P.S.: By the way I am using the Adobe Air based RegEx Desktop Tool for testing regular expressions.

解决方案

Contrary to all the answers here, for what you're trying to do regex is a perfectly valid solution. This is because you are NOT trying to match balanced tags-- THAT would be impossible with regex! But you are only matching what's in one tag, and that's perfectly regular.

Here's the problem, though. You can't do it with just one regex... you need to do one match to capture an <input> tag, then do further processing on that. Note that this will only work if none of the attribute values have a > character in them, so it's not perfect, but it should suffice for sane inputs.

Here's some Perl (pseudo)code to show you what I mean:

my $html = readLargeInputFile();

my @input_tags = $html =~ m/
    (
        <input                      # Starts with "<input"
        (?=[^>]*?type="hidden")     # Use lookahead to make sure that type="hidden"
        [^>]+                       # Grab the rest of the tag...
        \/>                         # ...except for the />, which is grabbed here
    )/xgm;

# Now each member of @input_tags is something like <input type="hidden" name="SaveRequired" value="False" />

foreach my $input_tag (@input_tags)
{
  my $hash_ref = {};
  # Now extract each of the fields one at a time.

  ($hash_ref->{"name"}) = $input_tag =~ /name="([^"]*)"/;
  ($hash_ref->{"value"}) = $input_tag =~ /value="([^"]*)"/;

  # Put $hash_ref in a list or something, or otherwise process it
}

The basic principle here is, don't try to do too much with one regular expression. As you noticed, regular expressions enforce a certain amount of order. So what you need to do instead is to first match the CONTEXT of what you're trying to extract, then do submatching on the data you want.

EDIT: However, I will agree that in general, using an HTML parser is probably easier and better and you really should consider redesigning your code or re-examining your objectives. :-) But I had to post this answer as a counter to the knee-jerk reaction that parsing any subset of HTML is impossible: HTML and XML are both irregular when you consider the entire specification, but the specification of a tag is decently regular, certainly within the power of PCRE.

这篇关于正则表达式模式不匹配字符串中的任何位的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆