在perl中解析表数据时遇到麻烦 [英] Trouble with parsing table data in perl

查看:100
本文介绍了在perl中解析表数据时遇到麻烦的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 < td class =MODULE_PRODUCTS_CELL对齐

=centervalign =topheight =100>
< table width =100summary =products>< tr>
< td align =centerheight =75>
< a href =/ collections.php?prod_id = 50>
< img src =files / products_categories50_t.txtborder =0alt =products/>< / a>< \ br>
< / td>
< / tr>
< tr>
< td align =center>
< a href =/ collections.php?prod_id = 50>< strong>缓冲区< / strong>< br />
< / a>
< td>
< / tr>< / table>
< / td>

在上面的html中我想提取:


  1. collections.php?prod_id = 50

  2. files / products_categories50_t .txt

  3. 缓冲区

我已经试过这段代码了,

 #!/ usr / local / bin / perl 

使用strict;
使用警告;
my $ filename ='sr.txt';

打开(FILENAME,$ filename);
my @str =< FILENAME>;
chomp(@str);
#print@str;

foreach my $ str(@str){
if($ str =〜/< td class(。*)< a href(。*?)>< \\ / td> /){
print***** $ 2 \\\
;
}
}

这段代码是一个试用版本。但它只带来最后一次发生,而不是每次发生。为什么?

解决方案

摘要



合理定义好的HTML片断是快速而容易的。但是在包含完全一般的,开放式HTML的无法预料的怪癖的整个文档中使用它们,虽然在理论上可行,但在实践中与使用已经为此明确目的而编写的其他人的分析器相比,实际上太难了。另请参阅此答案,以获得有关使用模式的更一般性讨论在XML或HTML上。

Naïve正则表达式解决方案



您需要一个正则表达式解决方案,所以我会为您提供这样的解决方案。

 #!/ usr / bin / perl 
使用5.10.0;
使用strict;
使用警告;

$ / = undef;
$ _ =< DATA>; #读取所有输入

while(m {<\ s * img [^>] * src \s * = \ s * [']?([^<> ;'] +)} gsix){
printIMG SRC = $ 1 \\\
;
}

while(m {'s * a [^>] * href \s * = \ s * [']?([^< >'] +)} gsix){
printA HREF = $ 1 \\\
; (*)强*(*)强*(*)*强*弱* \\ s *>} gsix){
printSTRONG = $ 1 \\\
;
}

__END__

< table width =100summary =products>
< tr>
< td align =centerheight =75>
< a href =/ collections.php?prod_id = 50>
< img src =files / products_categories50_t.txtborder =0alt =products/>
< / a>
< br />
< / td>
< / tr>
< tr>
< td align =center>
< a href =/ collections.php?prod_id = 50>
< strong>缓冲区< / strong>< br />
< / a>
< td>
< / tr>
< / table>
< / td>

该程序在运行时会产生以下输出:

  IMG SRC = files / products_categories50_t.txt 
HREF = / collections.php?prod_id = 50
HREF = / collections.php?prod_id = 50
STRONG =缓冲区

如果您确定可以用于特定HTML样本你希望它,然后通过一切手段使用它。注意我所做的几件你没有做的事情。其中之一不是一次处理HTML一行。这实际上从来没有效果。



然而,这种解决方案只适用于极其有限的形式的有效HTML。你只能在保证你正在使用的HTML看起来像你期望的时候才能使用它。



问题在于它通常不会看起来都很整齐。对于这些情况,强烈建议您使用HTML解析类。但是,似乎没有人向您显示执行此操作的代码。这不是很有用。

向导级正则表达式解决方案



我将成为其中一员我。因为我将向您展示一个更一般的解决方案来处理我认为您应该做的事情,但与其他任何曾经发布Stack Overflow的人不同,我将使用正则表达式来执行此操作,仅为了向您展示它可以完成,但是你不希望这样做:

 #!/ usr / bin / perl 
使用5.10.0;
使用strict;
使用警告;

$ / = undef;
$ _ =< DATA>; #读取所有输入

我们(
$ RX_SUBS,
$ tag_template_rx,
$ script_tag_rx,
$ style_tag_rx,
$ strong_tag_rx,
$ a_tag_rx,
$ img_tag_rx,
);

#带东西我们不应该看
s {< ;! DOCTYPE。*? > } {} SX;
s {< ;! \\ [CDATA \ [。*? \] \]> } {} GSX;

s {$ style_tag_rx。*? < (& WS)/(& WS)风格(& WS)> } {} gsix;
s {$ script_tag_rx。*? < (& WS)/(& WS)脚本(& WS)> } {} gsix;
s {<! - 。*? - > } {} GSX;

,(/ $ img_tag_rx / g){
my $ tag = $ + {TAG};
printfIMG标签在%d:%s \ n,pos(),$ tag;
while($ tag =〜
m {
$ RX_SUBS
\ b src(& WS)=(& WS)
(< VALUE>
(?:(?& quoted_value)|(& unquoted_value))

} gsix)
{
my $ value = dequote( $ + {VALUE});
打印\ tSRC是$ value \ n; ($ a_tag_rx / g){


}

($ / $ a
$ my $ tag = $ + {TAG};
printf%d处的标签:%s\\\
,pos(),$ tag;
while($ tag =〜
m {
$ RX_SUBS
\b href(?& WS)=(& WS)
(< VALUE>
(?:(?& quoted_value)|(& unquoted_value))

} gsix)
{
my $ value = dequote( $ + {VALUE});
打印\tHREF是$ value\\\
;

}

while(m {
$ strong_tag_rx(?& WS)
(?< BODY>。*?)(? & WS)
<(& WS)/(& WS)strong(& WS)>
} gsix)
{
my $ tag,$ body)= @ + {qw< TAG BODY> };
printfSTRONG标记在%d:%s\\\
\tBODY =%s\\\

pos(),$ + {TAG},$ + {BODY};
}

exit;

sub dequote {
my $ string = shift();
$ string =〜s {
^
(?< quote> ['])
(?< BODY>
(?:(?! )*

\ k
$
} {$ + {BODY}} gsx;
return $字符串;
}

子load_patterns {

$ RX_SUBS = qr {(?(DEFINE)

(?< any_attribute>
$ b $ w
(?& WS)=(?& WS)
(?:
(?& quoted_value)
|( ?& unquoted_value)



(?<未加引号的值>
(& unwhite_chunk)


(?" quotedvalue>
(?quot;" ['?))
(?:(?!\k" quot;))*
\k< ;报价>


(?< unwhite_chunk>
(?:
#(?![<>'])
(?! > )
\
)+


(?< WS> s *)

(< end_tag>
(& html_end_tag)
|(& xhtml_end_tag)


(?< html_end_tag>>)
( < xhtml_end_tag> />)

)#结束DEFINE

}六;

my $ _TAG_SUBS = $ RX_SUBS。 (?& WS)
(?& one_attribute)$ b(<属性>



$ b(?< one_attribute>
(?=(?& legal_attribute))
(?& any_attribute)


(?< optional_attribute>
(?& allowed_attribute)
|(?& deprecated_attribute)


(?< legal_attribute>
(?:(& required_attribute)
|(& optional_attribute)
|(& standard_attribute)
|(&



$ b $(< optional_attribute>);
$ ;
(?& permitted_attribute)
|(& deprecated_at致敬)


(?< illegal_attribute> (?& start_tag)
(& WS)
(?& WS)
$ b $?(< tag>
(& WS)
(& end_tag)


)#end DEFINE

} ; #这是aq标记,而不是qr

$ tag_template_rx = qr {

$ _TAG_SUBS

(?< TAG>(?& XXX_tag))

(?(DEFINE)
(?< XXX_tag>(?& tag))
(?< start_tag><(& WS (* FAIL))
(?< standard_attribute>(* FAIL))
(?< event_attribute>(* FAIL) )
(?< permitted_attribute>(* FAIL))
(?< deprecated_attribute>(* FAIL))

)#end DEFINE
}

$ script_tag_rx = qr {

$ _TAG_SUBS

(?< TAG>(& script_tag))
(? (DEFINE)
(?< script_tag>(& tag))
(?< start_tag><(& WS)style \b)
(< ; required_attribute>类型)
(?< allowed_attribute>
charset
| defer
| src
| xml:space

( (* FAIL))
(?< event_attribute>(* FAIL))
(?< deprecated_attribute>(* FAIL))
)#end DEFINE
}六;

$ style_tag_rx = qr {

$ _TAG_SUBS

(?< TAG>(& style_tag))
$ b (< style_tag>(& tag))

(?(start><(& WS)样式\b)

(?< required_attribute>类型)
(?< permitted_attribute>媒体)

(?< standard_attribute>
dir
| lang
| title
| xml:lang


(?< event_attribute>(* FAIL))
(?< allowed_attribute>(* FAIL))
(?< deprecated_attribute>(* FAIL))

)#end define

}

$ strong_tag_rx = qr {

$ _TAG_SUBS

(?< TAG>(& strong_tag))

$?b $ b $?b $ b $?b $ b $?b $ b $?
$?b $?
$?
$? & WS)
strong
\ b


(?< standard_attribute>
class
| dir
| ltr
| id
| lang
| style
| title
| xml:lang


(?< ; event_attribute>
点击
on dbl点击
点击鼠标
点击鼠标
点击鼠标
点击鼠标点击
在键上按下
按下
按键上按下



(?< required_attribute> (* FAIL))
(?< permitted_attribute>(* FAIL))
(?< optional_attribute>(* FAIL))
(?< deprecated_attribute>(* FAIL))

)#结束DEFINE

}六;

$ a_tag_rx = qr {

$ _TAG_SUBS

(?< TAG>(& a_tag))

$($($)$($&$; $<($&
a
\ b


(?< permitted_attribute>
charset
| coords
| href
| href lang
| name
| rel
| rev
| shape
| rect
| circle
| poly
|目标


(?< standard_attribute>
访问键
|类
| dir
| ltr
| ID
| lang
|风格
|标签索引
| title
| xml:lang


(?< event_attribute>
on blur
| on click
| on dbl click
| on focus
|在鼠标上移动
|在鼠标移动
|在鼠标上移出
|在鼠标上移动
|在移动鼠标上
|在按下
| on key按



(?< required_attribute>(* FAIL))
(?< deprecated_attribute>(* FAIL ))
)#end define
} xi;

$ img_tag_rx = qr {
$ _TAG_SUBS
(?< TAG>(& image_tag))
(?(DEFINE)
$ ($< image_tag>(&标签))

(< start_tag>
<(& WS)
img
\\ b


(?< required_attribute>
alt
| src


#NB:字符串文字中的空格
#以下不计算!只有
#存在易读性。

(?< permitted_attribute>
height
| is map
| long desc
| use map
| width


(?< deprecated_attribute>
align
| border
| h空间
| vspace


(?< standard_attribute>
class
| dir
| id
| style
| title
| xml:lang


(?< event_attribute>
on abort
| on点击
| on dbl点击
|在鼠标上按下
|在鼠标上按下
|在按键上按下
|在按键上按下
|在按下


###########################

)#end DEFINE

} six;



UNITCHECK {load_patterns()}

__END__

< table width =100summary =products>
< tr>
< td align =centerheight =75>
< a href =/ collections.php?prod_id = 50>
< img src =files / products_categories50_t.txtborder =0alt =products/>
< / a>
< br />
< / td>
< / tr>
< tr>
< td align =center>
< a href =/ collections.php?prod_id = 50>
< strong>缓冲区< / strong>< br />
< / a>
< td>
< / tr>
< / table>
< / td>

该程序在运行时会产生以下输出:

  IMG标签在304:< img src =files / products_categories50_t.txtborder =0alt =products/> 
SRC是files / products_categories50_t.txt
214处的标签:< a href =/ collections.php?prod_id = 50>
HREF是/collections.php?prod_id=50
451处的标签:< a href =/ collections.php?prod_id = 50>
HREF是/collections.php?prod_id=50
491处的STRONG标记:< strong>
BODY =缓冲区



您的选择 - 或 它?



这两个解决你的问题与正则表达式。这是可能的,你可以使用我的两种方法中的第一种。我不能说,因为就像这里看到的所有这些问题一样,你们没有告诉我们足够的关于我们(也可能是你们)的数据,以确定天真的方法是否足够。



如果没有,您有两种选择。


  1. 您可以要么使用我的第二种技术提供的更强大和更灵活的方法。只要确保你理解了它的所有方面,否则你将无法维护你的代码 - 也不会有其他人。

  2. 使用HTML解析类。
  3. li>

我发现即使1000人中有1人合理地做出这两个选择中的第一个,也不太可能。特别是,我发现非常不喜欢那些像我第一个解决方案那样简单的正则表达式的求助者会是一个能够管理我的第二个解决方案中给出的正则表达式的人。



只有一个选择可以让你真正留下 - 如果我可以松散地使用那个词。


I have a long htdoc of similar pattern which goes on like this:

<td class="MODULE_PRODUCTS_CELL " align="center" valign="top" height="100">
<table width="100" summary="products"><tr>
<td align="center" height="75">
<a href="/collections.php?prod_id=50">
<img src="files/products_categories50_t.txt" border="0" alt="products" /></a><\br>
</td>
</tr>
<tr>
<td align="center">
<a href="/collections.php?prod_id=50"><strong>Buffer</strong><br />
</a>
<td>
</tr></table>
</td>

In the above html I want to extract:

  1. collections.php?prod_id=50
  2. files/products_categories50_t.txt
  3. Buffer

I have tried this code to begin with,

#!/usr/local/bin/perl

use strict;
use warnings;
my $filename =  'sr.txt';

open(FILENAME,$filename);
my @str = <FILENAME>;
chomp(@str);
#print "@str";

foreach my  $str(@str){    
     if ($str =~/<td class(.*)<a href(.*?)><\/td>/) {
         print "*****$2\n";
     }    
}

This code is a trial one. However it brings only last occurrence and not each occurrence. Why?

解决方案

SUMMARY

Using patterns on little, limited pieces of reasonably well-defined pieces of HTML is quick and easy. But using them on an entire document containing fully general, open-ended HTML of unforeseeable quirks is, while theoretically possible, in practice much too hard compared with using someone else’s parser that’s already been written for that express purpose. See also this answer for a more general discussion on using patterns on XML or HTML.

Naïve Regex Solution

You’ve asked for a regex solution, so I will provide you such.

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

$/ = undef;
$_ = <DATA>;   # read all input

while (m{ < \s* img [^>]* src \s* = \s* ['"]? ([^<>'"]+) }gsix) {
    print "IMG SRC=$1\n";
}

while (m{ < \s* a [^>]* href \s* = \s* ['"]? ([^<>'"]+) }gsix) {
    print "A HREF=$1\n";
}

while (m{ < \s* strong [^>]* > (.*?) < \s* / \s* strong \s* > }gsix) {
    print "STRONG=$1\n";
}

__END__

<td class="MODULE_PRODUCTS_CELL" align="center" valign="top" height="100">
<table width="100" summary="products">
    <tr>
        <td align="center" height="75">
            <a href="/collections.php?prod_id=50">
                <img src="files/products_categories50_t.txt" border="0" alt="products" />
            </a>
            <br/>
        </td>
    </tr>
    <tr>
        <td align="center">
            <a href="/collections.php?prod_id=50">
                <strong>Buffer</strong><br />
            </a>
        <td>
    </tr>
</table>
</td>

That program, when run, produces this output:

IMG SRC=files/products_categories50_t.txt
A HREF=/collections.php?prod_id=50
A HREF=/collections.php?prod_id=50
STRONG=Buffer

If you are quite certain that works for the particular specimen of HTML that you wish it to, then by all means use it. Notice several things that I do which you didn’t. One of them is not dealing with the HTML a line at a time. That virtually never works.

However, this sort solutions works only on extremely limited forms of valid HTML. You can only use it when you can guarantee that the HTML you’re working with really looks like what you expect it to.

The problem is that it quite often does not look all neat and tidy. For these situations, you are strongly advised to use an HTML parsing class. However, no one seems to have shown you the code to do that. That’s not very helpful.

Wizard-Level Regex Solution

And I’m going to be one of them myself. Because I am going to show you a more general solution for approaching what I believe your take to be, but unlike anyone else who ever posts on Stack Overflow, I’m going to use regexes to do it, just to show you that it can be done, but that you do not wish to do it this way:

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

$/ = undef;
$_ = <DATA>;   # read all input

our(
    $RX_SUBS,
    $tag_template_rx,
    $script_tag_rx,
    $style_tag_rx,
    $strong_tag_rx,
    $a_tag_rx,
    $img_tag_rx,
);

# strip stuff we aren't supposed to look at
s{ <!    DOCTYPE  .*?         > }{}sx; 
s{ <! \[ CDATA \[ .*?    \]\] > }{}gsx; 

s{ $style_tag_rx  .*?  < (?&WS) / (?&WS) style  (?&WS) > }{}gsix; 
s{ $script_tag_rx .*?  < (?&WS) / (?&WS) script (?&WS) > }{}gsix; 
s{ <!--     .*?        --> }{}gsx;

while (/$img_tag_rx/g) {
    my $tag = $+{TAG};
    printf "IMG tag at %d: %s\n", pos(), $tag;
    while ($tag =~ 
        m{ 
            $RX_SUBS  
            \b src (?&WS) = (?&WS) 
            (?<VALUE> 
                (?: (?&quoted_value) | (?&unquoted_value) ) 
            )
        }gsix) 
    {
        my $value = dequote($+{VALUE});
        print "\tSRC is $value\n";
    } 

} 

while (/$a_tag_rx/g) {
    my $tag = $+{TAG};
    printf "A tag at %d: %s\n", pos(), $tag;
    while ($tag =~ 
        m{ 
            $RX_SUBS  
            \b href (?&WS) = (?&WS) 
            (?<VALUE> 
                (?: (?&quoted_value) | (?&unquoted_value) ) 
            )
        }gsix) 
    {
        my $value = dequote($+{VALUE});
        print "\tHREF is $value\n";
    } 
} 

while (m{
            $strong_tag_rx  (?&WS) 
            (?<BODY> .*? )  (?&WS) 
            < (?&WS) / (?&WS) strong (?&WS) > 
        }gsix) 
{
    my ($tag, $body) = @+{ qw< TAG BODY > };
    printf "STRONG tag at %d: %s\n\tBODY=%s\n", 
            pos(), $+{TAG}, $+{BODY};
} 

exit;

sub dequote { 
    my $string = shift();
    $string =~ s{
        ^
        (?<quote>   ["']      )
        (?<BODY> 
            (?: (?! \k<quote> ) . ) *
        )
        \k<quote> 
        $
    }{$+{BODY}}gsx;
    return $string;
}

sub load_patterns { 

    $RX_SUBS = qr{ (?(DEFINE)

        (?<any_attribute> 
            \b \w+
            (?&WS) = (?&WS) 
            (?:
                (?&quoted_value)
              | (?&unquoted_value)
            )
        )

        (?<unquoted_value> 
            (?&unwhite_chunk) 
        )

        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )

        (?<unwhite_chunk>   
            (?:
                # (?! [<>'"] ) 
                (?! > ) 
                \S
            ) +   
        )

        (?<WS>     \s *   )

        (?<end_tag>          
            (?&html_end_tag)
          | (?&xhtml_end_tag)
        )

        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )

      ) # end DEFINE

    }six;

    my $_TAG_SUBS = $RX_SUBS . q{ (?(DEFINE)

        (?<attributes>
            (?: 
                (?&WS) 
                (?&one_attribute) 
            ) *
        )

        (?<one_attribute>
            (?= (?&legal_attribute) )
            (?&any_attribute) 
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        (?<legal_attribute> 
            (?: (?&required_attribute)
              | (?&optional_attribute)
              | (?&standard_attribute)
              | (?&event_attribute)
              # for LEGAL parse only, comment out next line 
              | (?&illegal_attribute)
            )
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        (?<illegal_attribute> \b \w+ \b )

        (?<tag>
            (?&start_tag)
            (?&WS) 
            (?&attributes) 
            (?&WS) 
            (?&end_tag)
        )

      ) # end DEFINE

    };  # this is a q tag, not a qr

    $tag_template_rx = qr{ 

            $_TAG_SUBS

        (?<TAG> (?&XXX_tag) )

        (?(DEFINE)
            (?<XXX_tag>     (?&tag)             )
            (?<start_tag>  < (?&WS) XXX \b      )
            (?<required_attribute>      (*FAIL) )
            (?<standard_attribute>      (*FAIL) )
            (?<event_attribute>         (*FAIL) )
            (?<permitted_attribute>     (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )

        ) # end DEFINE
    }six;

    $script_tag_rx = qr{   

            $_TAG_SUBS

        (?<TAG> (?&script_tag) )
        (?(DEFINE)
            (?<script_tag>  (?&tag)                )
            (?<start_tag>  < (?&WS) style \b       )
            (?<required_attribute>      type )
            (?<permitted_attribute>             
                charset     
              | defer
              | src
              | xml:space
            )
            (?<standard_attribute>      (*FAIL) )
            (?<event_attribute>         (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )
        ) # end DEFINE
    }six;

    $style_tag_rx = qr{    

            $_TAG_SUBS

        (?<TAG> (?&style_tag) )

        (?(DEFINE)

            (?<style_tag>  (?&tag)  )

            (?<start_tag>  < (?&WS) style \b       )

            (?<required_attribute>      type    )
            (?<permitted_attribute>     media   )

            (?<standard_attribute>
                dir
              | lang
              | title
              | xml:lang
            )

            (?<event_attribute>         (*FAIL) )
            (?<permitted_attribute>     (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )

        )  # end define

    }six;

    $strong_tag_rx = qr{    

            $_TAG_SUBS

        (?<TAG> (?&strong_tag) )

        (?(DEFINE)

            (?<strong_tag>  (?&tag)  )

            (?<start_tag>  
                < (?&WS) 
                strong 
                \b       
            )

            (?<standard_attribute>
                class       
              | dir 
              | ltr 
              | id  
              | lang        
              | style       
              | title       
              | xml:lang
            )

            (?<event_attribute>
                on click    
                on dbl click        
                on mouse down       
                on mouse move       
                on mouse out        
                on mouse over       
                on mouse up 
                on key down 
                on key press        
                on key up
            )

            (?<required_attribute>      (*FAIL) )
            (?<permitted_attribute>     (*FAIL) )
            (?<optional_attribute>      (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )

        ) # end DEFINE

    }six; 

    $a_tag_rx = qr{         

            $_TAG_SUBS

        (?<TAG> (?&a_tag) )

        (?(DEFINE)
            (?<a_tag>  (?&tag)  )

            (?<start_tag>  
                < (?&WS) 
                a 
                \b       
            )

            (?<permitted_attribute>
                charset     
              | coords      
              | href        
              | href lang   
              | name        
              | rel 
              | rev 
              | shape       
              | rect
              | circle
              | poly        
              | target
            )

            (?<standard_attribute>
                access key  
              | class       
              | dir 
              | ltr 
              | id
              | lang        
              | style       
              | tab index   
              | title       
              | xml:lang
            )

            (?<event_attribute>
                on blur     
              | on click    
              | on dbl click        
              | on focus    
              | on mouse down       
              | on mouse move       
              | on mouse out        
              | on mouse over       
              | on mouse up 
              | on key down 
              | on key press        
                on key up
            )

            (?<required_attribute>      (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )
        ) # end define
    }xi;

    $img_tag_rx = qr{           
        $_TAG_SUBS
        (?<TAG> (?&image_tag) )
        (?(DEFINE)

            (?<image_tag> (?&tag) )

            (?<start_tag>  
                < (?&WS) 
                img 
                \b       
            )

            (?<required_attribute>
                alt
              | src
            )

            # NB: The white space in string literals 
            #     below DOES NOT COUNT!   It's just 
            #     there for legibility.

            (?<permitted_attribute>
                height
              | is map
              | long desc
              | use map
              | width
            )

            (?<deprecated_attribute>
                 align
               | border
               | hspace
               | vspace
            )

            (?<standard_attribute>
                class
              | dir
              | id
              | style
              | title
              | xml:lang
            )

            (?<event_attribute>
                on abort
              | on click
              | on dbl click
              | on mouse down
              | on mouse out
              | on key down
              | on key press
              | on key up
            )

        ###########################

        ) # end DEFINE

    }six;

}

UNITCHECK { load_patterns() } 

__END__

<td class="MODULE_PRODUCTS_CELL" align="center" valign="top" height="100">
<table width="100" summary="products">
    <tr>
        <td align="center" height="75">
            <a href="/collections.php?prod_id=50">
                <img src="files/products_categories50_t.txt" border="0" alt="products" />
            </a>
            <br/>
        </td>
    </tr>
    <tr>
        <td align="center">
            <a href="/collections.php?prod_id=50">
                <strong>Buffer</strong><br />
            </a>
        <td>
    </tr>
</table>
</td>

That program, when run, produces this output:

IMG tag at 304: <img src="files/products_categories50_t.txt" border="0" alt="products" />
        SRC is files/products_categories50_t.txt
A tag at 214: <a href="/collections.php?prod_id=50">
        HREF is /collections.php?prod_id=50
A tag at 451: <a href="/collections.php?prod_id=50">
        HREF is /collections.php?prod_id=50
STRONG tag at 491: <strong>
        BODY=Buffer

The Choice Is Yours — Or Is It?

Both those solve your problem with regexes. It is possible that you will be able to use the first of my two approaches. I cannot say, because like seemingly all such questions asked here, you haven’t told us enough about the data for us (and perhaps also you) to know for sure whether the naïve approach will suffice.

When it doesn’t, you have two choices.

  1. You can either use the more robust and flexible approach offered by my second technique. Just make certain that you understand it in all its aspects, because otherwise you won’t be able to maintain your code — and neither will anybody else.
  2. Use an HTML parsing class.

I find it unlikely that even 1 person in a 1000 would reasonably make the first of those two choices. In particular, I find it extremely unlikey that someone who asks for help with regexes as simple as those in my first solution would be a person capable of managing the regexes given in my second solution.

Which really leaves you with only one "choice" — if I may use that word so loosely.

这篇关于在perl中解析表数据时遇到麻烦的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆