使用RegEx提取表单字段 [英] Extract form fields using RegEx

查看:148
本文介绍了使用RegEx提取表单字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种方法,可以从给定具体网址和表单名称的页面获取所有表单输入和相应的值。

  {
返回数组

'field_name_1'=>'value_1',
'field_name_2' ='value_2',
'select_field_name'=>数组('option_1','option_2','option_3'),
);
}

GetForm('http://www.google.com/','f');

任何人都可以提供必要的正则表达式来完成此操作吗?

编辑:我知道查询DOM会更加可靠,但是我正在寻找的是一个网站不可知的解决方案,它可以让我获得所有的字段给定的形式。我不相信这是可能的,而不先知道文档节点,我错了吗?



我不需要一个防弹解决方案,只是一些工程在标准网页上,对于FORM标签,我提出了以下正则表达式;

 '〜< form。* '。'。$ name。'[\']。。*?>(。+?)< / form>〜是'

我相信为输入字段做类似的事情并不困难,我觉得最具挑战性的是选择和选项字段的RegEx。

解决方案

使用正则表达式解析HTML可能不是最好的选择。



您可以查看 DOMDocument :: loadHTML ,其中将允许您使用DOM方法处理HTML文档(例如,如果您知道那些,则使用XPath查询)。



您也可以使用w蚂蚁看看 Zend_Dom Zend_Dom_Query code> ,顺便说一句,如果您可以在您的应用程序中使用Zend Framework的某些部分,那么这很好。

当使用 Zend_Test ,并且工作得很好; - )



它可能看起来比较难...但是,考虑到一些HTML页面的混乱,这可能是一个更明智的想法......






在评论和OP编辑之后进行编辑

以下是一些想法, 简单,一个输入标签:


  • 它可以按照几行传播
  • 许多属性

  • 仅对名称和值进行限制是您感兴趣的,您必须处理这两个事件可能以任何可能的顺序进行处理的事实

  • 属性可以包含双引号,单引号或甚至不包含其值

  • 标记/属性可以是小写或属性大写

  • 标签并非总是必须关闭



那么,一些这些点是无效的 - HTML;但仍然在最常见的网页浏览器中工作,所以必须考虑到它们。



只有这些点,我才不想成为一个写正则表达式^^

但我想可能还有其他困难,我没有想到。





另一方面,您有DOM和xpath ...要获取输入名称的值=q(例如本页),它是一个问题是这样的:

  $ url ='http://www.google.fr/search?q=test&即= UTF-8和; OE = UTF-8&安培;水溶液= T&安培; RLS = com.ubuntu:EN-US:非官方&安培;客户=火狐-A'; 
$ html = file_get_contents($ url);
$ dom = new DOMDocument();
if(@ $ dom-> loadHTML($ html)){
//是的,不一定有效-html ...
$ xpath = new DOMXpath($ dom);

$ nodeList = $ xpath-> query('// input [@ name =q]');
if($ nodeList-> length> 0){
for($ i = 0; $ i< $ nodeList-> length; $ i ++){
$ node = $ nodeList->项目($ⅰ);
var_dump($ node-> getAttribute('value'));
}
}

}其他{
//太坏...
}

这里重要的是什么? XPath查询,并且只有那个...并且里面有静态/常量吗?

嗯,我说我想要所有< input> ,其中名称属性等于 q

它只是工作:我得到这个结果:

$ p $ string $'test'(length = 4)
string'test'(length = 4)

:页面上有两个输入名称=q^^)



我知道页面的结构吗?绝对不是;-)

我只知道我/你/我们想要输入标签名为q; - )

这就是我们得到的; - )






编辑2:选择和选项有点乐趣



好吧,只是为了好玩,下面是我选择和选择的内容:

  $ url ='http://www.google.fr/language_tools?hl=fr'; 
$ html = file_get_contents($ url);
$ dom = new DOMDocument();
if(@ $ dom-> loadHTML($ html)){
//是的,不一定有效-html ...
$ xpath = new DOMXpath($ dom);

$ nodeListSelects = $ xpath-> query('// select');
if($ nodeListSelects-> length> 0){
for($ i = 0; $ i< $ nodeListSelects-> length; $ i ++){
$ nodeSelect = $ nodeListSelects->项目($ⅰ);
$ name = $ nodeSelect-> getAttribute('name');
$ nodeListOptions = $ xpath-> query('option [@ selected =selected]',$ nodeSelect); //($ nodeListOptions-> length> 0){
for($ j = 0; $ j< $ nodeListOptions-> length; $ j ++){
$ nodeOption = $ nodeListOptions-> item($ j);
$ value = $ nodeOption-> getAttribute('value');
var_dump(name ='$ name'=> value ='$ value');
}
}
}
}
} else {
//太坏...
}

我得到一个输出结果:

  string'name ='sl'=> value ='fr''(length = 23)
string'name ='tl'=> value ='en''(length = 23)
string'name ='sl'=> value ='en''(length = 23)
string'name ='tl'=> value ='fr''(length = 23)
string'name ='sl'=> value ='en''(length = 23)
string'name ='tl'=> value ='fr''(length = 23)

这是我的预期。




一些解释?

好吧,首先,我得到页面的所有选择标签,并保留他们的名字在记忆中。

然后,对于其中的每一个,我都会得到作为其后代的选定选项标记(总是只有一个,btw)。

在这里,我有这个值。

前面的例子更复杂一点...但仍比regex更容易,I相信......花了我10分钟,而不是更多......而我仍然不会有勇气(疯狂?)开始思考某种能够做到这一点的突变正则表达式:-D



哦,作为一个旁注:我仍然不知道HTML文档的结构是什么样的:我甚至没有看过它的源代码^^





我希望这可以帮助更多...

谁知道,也许我会说服你,正则表达式不是当谈到解析HTML时好主意......也许? ; - )



仍然:玩得开心!

I'm looking for a way to get all the form inputs and respective values from a page given a specific URL and form name.

function GetForm($url, $name)
{
    return array
    (
        'field_name_1' => 'value_1',
        'field_name_2' => 'value_2',
        'select_field_name' => array('option_1', 'option_2', 'option_3'),
    );
}

GetForm('http://www.google.com/', 'f');

Can anyone provide me with the necessary regular expressions to accomplish this?

EDIT: I understand that querying the DOM would be far more reliable, however what I'm looking for is a website agnostic solution that allows me to get all the fields of a given form. I don't believe this is possible with DOM without knowing the document nodes first, am I wrong?

I don't need a bullet proof solution, just something that works on standard web pages, for the FORM tag I've come up with the following RegEx;

'~<form.*?name=[\'"]?' . $name . '[\'"]?.*?>(.+?)</form>~is'

I believe that doing something similar for input fields won't be difficult, what I find most challenging is the RegEx for the select and option fields.

解决方案

Using regex to parse HTML is probably not the best way to go.

You might take a look at DOMDocument::loadHTML, which will allow you to work with an HTML document using DOM methods (and XPath queries, for instance, if you know those).

You might also want to take a look at Zend_Dom and Zend_Dom_Query, btw, which are quite nice if you can use some parts of Zend Framework in your application.
They are used to get fetch data from HTML pages when doing functionnal testing with Zend_Test, for instance -- and work quite well ;-)

It may seem harder in the first place... But, considering the mess some HTML pages are, it is probably a much wiser idea...


EDIT after the comment and the edit of the OP

Here are a couple of thought about, to begin by something "simple", an input tag :

  • it can spread accross several lines
  • it can have many attributes
  • condirering only name and value are of interest to you, you have to deal with the fact that those two can be in any possible order
  • attributes can have double-quotes, single-quotes, or even nothing arround their values
  • tags / attributes can be both lower-case or upper-case
  • tags don't always have to be closed

Well, some of those points are not valid-HTML ; but still work in the most commons web-browsers, so they have to be taken into account...

Only with those points, I wouldn't like to be the one writting the regex ^^
But I suppose there might be others difficulties I didn't think about.


On the other side, you have DOM and xpath... To get the value of an input name="q" (example is this page), it's a matter of something like this :

$url = 'http://www.google.fr/search?q=test&ie=utf-8&oe=utf-8&aq=t&rls=com.ubuntu:en-US:unofficial&client=firefox-a';
$html = file_get_contents($url);
$dom = new DOMDocument();
if (@$dom->loadHTML($html)) {
    // yep, not necessarily valid-html...
    $xpath = new DOMXpath($dom);

    $nodeList = $xpath->query('//input[@name="q"]');
    if ($nodeList->length > 0) {
        for ($i=0 ; $i<$nodeList->length ; $i++) {
            $node = $nodeList->item($i);
            var_dump($node->getAttribute('value'));
        }
    }

} else {
    // too bad...
}

What matters here ? The XPath query, and only that... And is there anything static/constant in it ?
Well, I say I want all <input> that have a name attribute that is equal to "q".
And it just works : I'm getting this result :

string 'test' (length=4)
string 'test' (length=4)

(I checked : there are two input name="q" on the page ^^ )

Do I know the structure of the page ? Absolutly not ;-)
I just know I/you/we want input tags named q ;-)

And that's what we get ;-)


EDIT 2 : and a bit fun with select and options :

Well, just for fun, here's what I came up for select and option :

$url = 'http://www.google.fr/language_tools?hl=fr';
$html = file_get_contents($url);
$dom = new DOMDocument();
if (@$dom->loadHTML($html)) {
    // yep, not necessarily valid-html...
    $xpath = new DOMXpath($dom);

    $nodeListSelects = $xpath->query('//select');
    if ($nodeListSelects->length > 0) {
        for ($i=0 ; $i<$nodeListSelects->length ; $i++) {
            $nodeSelect = $nodeListSelects->item($i);
            $name = $nodeSelect->getAttribute('name');
            $nodeListOptions = $xpath->query('option[@selected="selected"]', $nodeSelect);  // We want options that are inside the current select
            if ($nodeListOptions->length > 0) {
                for ($j=0 ; $j<$nodeListOptions->length ; $j++) {
                    $nodeOption = $nodeListOptions->item($j);
                    $value = $nodeOption->getAttribute('value');
                    var_dump("name='$name' => value='$value'");
                }
            }
        }
    }
} else {
    // too bad...
}

And I get as an output :

string 'name='sl' => value='fr'' (length=23)
string 'name='tl' => value='en'' (length=23)
string 'name='sl' => value='en'' (length=23)
string 'name='tl' => value='fr'' (length=23)
string 'name='sl' => value='en'' (length=23)
string 'name='tl' => value='fr'' (length=23)

Which is what I expected.


Some explanations ?

Well, first of all, I get all the select tags of the page, and keep their name in memory.
Then, for each one of those, I get the selected option tags that are its descendants (there's always only one, btw).
And here, I have the value.

A bit more complicated that the previous example... But still much more easy than regex, I believe... Took me maybe 10 minutes, not more... And I still won't have the courage (madness ?) to start thinkg about some kind of mutant regex that would be able to do that :-D

Oh, and, as a sidenote : I still have no idea what the structure of the HTML document looks like : I have not even taken a single look at it's source ^^


I hope this helps a bit more...
Who knows, maybe I'll convince you regex are not a good idea when it comes to parsing HTML... maybe ? ;-)

Still : have fun !

这篇关于使用RegEx提取表单字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆