perl解析表数据的问题 [英] Trouble with parsing table data in perl

查看:31
本文介绍了perl解析表数据的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很长的类似模式的 htdoc,内容如下:

I have a long htdoc of similar pattern which goes on like this:

<td class="MODULE_PRODUCTS_CELL " align="center" valign="top" height="100">
<table width="100" summary="products"><tr>
<td align="center" height="75">
<a href="/collections.php?prod_id=50">
<img src="files/products_categories50_t.txt" border="0" alt="products" /></a><r>
</td>
</tr>
<tr>
<td align="center">
<a href="/collections.php?prod_id=50"><strong>Buffer</strong><br />
</a>
<td>
</tr></table>
</td>

在上面的html中我想提取:

In the above html I want to extract:

  1. collections.php?prod_id=50
  2. files/products_categories50_t.txt
  3. 缓冲区

我已经尝试过这个代码,

I have tried this code to begin with,

#!/usr/local/bin/perl

use strict;
use warnings;
my $filename =  'sr.txt';

open(FILENAME,$filename);
my @str = <FILENAME>;
chomp(@str);
#print "@str";

foreach my  $str(@str){    
     if ($str =~/<td class(.*)<a href(.*?)></td>/) {
         print "*****$2
";
     }    
}

此代码是试用版.然而,它只带来最后一次出现,而不是每次出现.为什么?

This code is a trial one. However it brings only last occurrence and not each occurrence. Why?

推荐答案

SUMMARY

在定义合理的少量 HTML 片段上使用模式既快速又简单.但是,在包含不可预见的怪癖的完全通用的、开放式 HTML 的整个文档上使用它们,虽然理论上是可能的,但与使用其他人为该明确目的而编写的解析器相比,实际上太难了.另请参阅此答案,了解有关使用模式的更一般性讨论在 XML 或 HTML 上.

SUMMARY

Using patterns on little, limited pieces of reasonably well-defined pieces of HTML is quick and easy. But using them on an entire document containing fully general, open-ended HTML of unforeseeable quirks is, while theoretically possible, in practice much too hard compared with using someone else’s parser that’s already been written for that express purpose. See also this answer for a more general discussion on using patterns on XML or HTML.

您要求提供正则表达式解决方案,所以我会为您提供这样的解决方案.

You’ve asked for a regex solution, so I will provide you such.

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

$/ = undef;
$_ = <DATA>;   # read all input

while (m{ < s* img [^>]* src s* = s* ['"]? ([^<>'"]+) }gsix) {
    print "IMG SRC=$1
";
}

while (m{ < s* a [^>]* href s* = s* ['"]? ([^<>'"]+) }gsix) {
    print "A HREF=$1
";
}

while (m{ < s* strong [^>]* > (.*?) < s* / s* strong s* > }gsix) {
    print "STRONG=$1
";
}

__END__

<td class="MODULE_PRODUCTS_CELL" align="center" valign="top" height="100">
<table width="100" summary="products">
    <tr>
        <td align="center" height="75">
            <a href="/collections.php?prod_id=50">
                <img src="files/products_categories50_t.txt" border="0" alt="products" />
            </a>
            <br/>
        </td>
    </tr>
    <tr>
        <td align="center">
            <a href="/collections.php?prod_id=50">
                <strong>Buffer</strong><br />
            </a>
        <td>
    </tr>
</table>
</td>

该程序在运行时产生以下输出:

That program, when run, produces this output:

IMG SRC=files/products_categories50_t.txt
A HREF=/collections.php?prod_id=50
A HREF=/collections.php?prod_id=50
STRONG=Buffer

如果您非常确定它适用于您希望的特定 HTML 样本,那么请务必使用它.请注意我做了而你没有做的几件事.其中之一不是一次处理一行 HTML.这几乎行不通.

If you are quite certain that works for the particular specimen of HTML that you wish it to, then by all means use it. Notice several things that I do which you didn’t. One of them is not dealing with the HTML a line at a time. That virtually never works.

但是,这种排序解决方案仅适用于极其有限的有效 HTML 形式.只有当您可以保证所使用的 HTML 确实符合您的预期时,您才能使用它.

However, this sort solutions works only on extremely limited forms of valid HTML. You can only use it when you can guarantee that the HTML you’re working with really looks like what you expect it to.

问题在于它通常看起来并不整洁.对于这些情况,强烈建议您使用 HTML 解析类.但是,似乎没有人向您展示了执行此操作的代码.这不是很有帮助.

The problem is that it quite often does not look all neat and tidy. For these situations, you are strongly advised to use an HTML parsing class. However, no one seems to have shown you the code to do that. That’s not very helpful.

我自己也将成为其中之一.因为我将向您展示一个更通用的解决方案来接近我认为您的想法,但与其他任何在 Stack Overflow 上发帖的人不同,我将使用正则表达式来做到这一点,只是为了向您展示可以做到,但您希望这样做:

And I’m going to be one of them myself. Because I am going to show you a more general solution for approaching what I believe your take to be, but unlike anyone else who ever posts on Stack Overflow, I’m going to use regexes to do it, just to show you that it can be done, but that you do not wish to do it this way:

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

$/ = undef;
$_ = <DATA>;   # read all input

our(
    $RX_SUBS,
    $tag_template_rx,
    $script_tag_rx,
    $style_tag_rx,
    $strong_tag_rx,
    $a_tag_rx,
    $img_tag_rx,
);

# strip stuff we aren't supposed to look at
s{ <!    DOCTYPE  .*?         > }{}sx; 
s{ <! [ CDATA [ .*?    ]] > }{}gsx; 

s{ $style_tag_rx  .*?  < (?&WS) / (?&WS) style  (?&WS) > }{}gsix; 
s{ $script_tag_rx .*?  < (?&WS) / (?&WS) script (?&WS) > }{}gsix; 
s{ <!--     .*?        --> }{}gsx;

while (/$img_tag_rx/g) {
    my $tag = $+{TAG};
    printf "IMG tag at %d: %s
", pos(), $tag;
    while ($tag =~ 
        m{ 
            $RX_SUBS  
             src (?&WS) = (?&WS) 
            (?<VALUE> 
                (?: (?&quoted_value) | (?&unquoted_value) ) 
            )
        }gsix) 
    {
        my $value = dequote($+{VALUE});
        print "	SRC is $value
";
    } 

} 

while (/$a_tag_rx/g) {
    my $tag = $+{TAG};
    printf "A tag at %d: %s
", pos(), $tag;
    while ($tag =~ 
        m{ 
            $RX_SUBS  
             href (?&WS) = (?&WS) 
            (?<VALUE> 
                (?: (?&quoted_value) | (?&unquoted_value) ) 
            )
        }gsix) 
    {
        my $value = dequote($+{VALUE});
        print "	HREF is $value
";
    } 
} 

while (m{
            $strong_tag_rx  (?&WS) 
            (?<BODY> .*? )  (?&WS) 
            < (?&WS) / (?&WS) strong (?&WS) > 
        }gsix) 
{
    my ($tag, $body) = @+{ qw< TAG BODY > };
    printf "STRONG tag at %d: %s
	BODY=%s
", 
            pos(), $+{TAG}, $+{BODY};
} 

exit;

sub dequote { 
    my $string = shift();
    $string =~ s{
        ^
        (?<quote>   ["']      )
        (?<BODY> 
            (?: (?! k<quote> ) . ) *
        )
        k<quote> 
        $
    }{$+{BODY}}gsx;
    return $string;
}

sub load_patterns { 

    $RX_SUBS = qr{ (?(DEFINE)

        (?<any_attribute> 
             w+
            (?&WS) = (?&WS) 
            (?:
                (?&quoted_value)
              | (?&unquoted_value)
            )
        )

        (?<unquoted_value> 
            (?&unwhite_chunk) 
        )

        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! k<quote> ) . ) *
            k<quote> 
        )

        (?<unwhite_chunk>   
            (?:
                # (?! [<>'"] ) 
                (?! > ) 
                S
            ) +   
        )

        (?<WS>     s *   )

        (?<end_tag>          
            (?&html_end_tag)
          | (?&xhtml_end_tag)
        )

        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )

      ) # end DEFINE

    }six;

    my $_TAG_SUBS = $RX_SUBS . q{ (?(DEFINE)

        (?<attributes>
            (?: 
                (?&WS) 
                (?&one_attribute) 
            ) *
        )

        (?<one_attribute>
            (?= (?&legal_attribute) )
            (?&any_attribute) 
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        (?<legal_attribute> 
            (?: (?&required_attribute)
              | (?&optional_attribute)
              | (?&standard_attribute)
              | (?&event_attribute)
              # for LEGAL parse only, comment out next line 
              | (?&illegal_attribute)
            )
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        (?<illegal_attribute>  w+  )

        (?<tag>
            (?&start_tag)
            (?&WS) 
            (?&attributes) 
            (?&WS) 
            (?&end_tag)
        )

      ) # end DEFINE

    };  # this is a q tag, not a qr

    $tag_template_rx = qr{ 

            $_TAG_SUBS

        (?<TAG> (?&XXX_tag) )

        (?(DEFINE)
            (?<XXX_tag>     (?&tag)             )
            (?<start_tag>  < (?&WS) XXX       )
            (?<required_attribute>      (*FAIL) )
            (?<standard_attribute>      (*FAIL) )
            (?<event_attribute>         (*FAIL) )
            (?<permitted_attribute>     (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )

        ) # end DEFINE
    }six;

    $script_tag_rx = qr{   

            $_TAG_SUBS

        (?<TAG> (?&script_tag) )
        (?(DEFINE)
            (?<script_tag>  (?&tag)                )
            (?<start_tag>  < (?&WS) style        )
            (?<required_attribute>      type )
            (?<permitted_attribute>             
                charset     
              | defer
              | src
              | xml:space
            )
            (?<standard_attribute>      (*FAIL) )
            (?<event_attribute>         (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )
        ) # end DEFINE
    }six;

    $style_tag_rx = qr{    

            $_TAG_SUBS

        (?<TAG> (?&style_tag) )

        (?(DEFINE)

            (?<style_tag>  (?&tag)  )

            (?<start_tag>  < (?&WS) style        )

            (?<required_attribute>      type    )
            (?<permitted_attribute>     media   )

            (?<standard_attribute>
                dir
              | lang
              | title
              | xml:lang
            )

            (?<event_attribute>         (*FAIL) )
            (?<permitted_attribute>     (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )

        )  # end define

    }six;

    $strong_tag_rx = qr{    

            $_TAG_SUBS

        (?<TAG> (?&strong_tag) )

        (?(DEFINE)

            (?<strong_tag>  (?&tag)  )

            (?<start_tag>  
                < (?&WS) 
                strong 
                       
            )

            (?<standard_attribute>
                class       
              | dir 
              | ltr 
              | id  
              | lang        
              | style       
              | title       
              | xml:lang
            )

            (?<event_attribute>
                on click    
                on dbl click        
                on mouse down       
                on mouse move       
                on mouse out        
                on mouse over       
                on mouse up 
                on key down 
                on key press        
                on key up
            )

            (?<required_attribute>      (*FAIL) )
            (?<permitted_attribute>     (*FAIL) )
            (?<optional_attribute>      (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )

        ) # end DEFINE

    }six; 

    $a_tag_rx = qr{         

            $_TAG_SUBS

        (?<TAG> (?&a_tag) )

        (?(DEFINE)
            (?<a_tag>  (?&tag)  )

            (?<start_tag>  
                < (?&WS) 
                a 
                       
            )

            (?<permitted_attribute>
                charset     
              | coords      
              | href        
              | href lang   
              | name        
              | rel 
              | rev 
              | shape       
              | rect
              | circle
              | poly        
              | target
            )

            (?<standard_attribute>
                access key  
              | class       
              | dir 
              | ltr 
              | id
              | lang        
              | style       
              | tab index   
              | title       
              | xml:lang
            )

            (?<event_attribute>
                on blur     
              | on click    
              | on dbl click        
              | on focus    
              | on mouse down       
              | on mouse move       
              | on mouse out        
              | on mouse over       
              | on mouse up 
              | on key down 
              | on key press        
                on key up
            )

            (?<required_attribute>      (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )
        ) # end define
    }xi;

    $img_tag_rx = qr{           
        $_TAG_SUBS
        (?<TAG> (?&image_tag) )
        (?(DEFINE)

            (?<image_tag> (?&tag) )

            (?<start_tag>  
                < (?&WS) 
                img 
                       
            )

            (?<required_attribute>
                alt
              | src
            )

            # NB: The white space in string literals 
            #     below DOES NOT COUNT!   It's just 
            #     there for legibility.

            (?<permitted_attribute>
                height
              | is map
              | long desc
              | use map
              | width
            )

            (?<deprecated_attribute>
                 align
               | border
               | hspace
               | vspace
            )

            (?<standard_attribute>
                class
              | dir
              | id
              | style
              | title
              | xml:lang
            )

            (?<event_attribute>
                on abort
              | on click
              | on dbl click
              | on mouse down
              | on mouse out
              | on key down
              | on key press
              | on key up
            )

        ###########################

        ) # end DEFINE

    }six;

}

UNITCHECK { load_patterns() } 

__END__

<td class="MODULE_PRODUCTS_CELL" align="center" valign="top" height="100">
<table width="100" summary="products">
    <tr>
        <td align="center" height="75">
            <a href="/collections.php?prod_id=50">
                <img src="files/products_categories50_t.txt" border="0" alt="products" />
            </a>
            <br/>
        </td>
    </tr>
    <tr>
        <td align="center">
            <a href="/collections.php?prod_id=50">
                <strong>Buffer</strong><br />
            </a>
        <td>
    </tr>
</table>
</td>

该程序在运行时产生以下输出:

That program, when run, produces this output:

IMG tag at 304: <img src="files/products_categories50_t.txt" border="0" alt="products" />
        SRC is files/products_categories50_t.txt
A tag at 214: <a href="/collections.php?prod_id=50">
        HREF is /collections.php?prod_id=50
A tag at 451: <a href="/collections.php?prod_id=50">
        HREF is /collections.php?prod_id=50
STRONG tag at 491: <strong>
        BODY=Buffer

选择是你的——还是?

两者都解决了正则表达式的问题.您有可能使用我的两种方法中的第一种.我不能说,因为就像这里提出的所有此类问题一样,您还没有告诉我们足够的数据,让我们(也许还有您)知道这种幼稚的方法是否足够.

The Choice Is Yours — Or Is It?

Both those solve your problem with regexes. It is possible that you will be able to use the first of my two approaches. I cannot say, because like seemingly all such questions asked here, you haven’t told us enough about the data for us (and perhaps also you) to know for sure whether the naïve approach will suffice.

如果没有,你有两个选择.

When it doesn’t, you have two choices.

  1. 您可以使用我的第二种技术提供的更强大、更灵活的方法.只需确保您了解它的所有方面,否则您将无法维护您的代码 - 其他任何人也无法维护.
  2. 使用 HTML 解析类.

我发现即使是 1000 人中的一个人也不太可能合理地做出这两个选择中的第一个.尤其是,我发现在我的第一个解决方案中寻求正则表达式帮助的人极不可能成为能够管理我的第二个解决方案中给出的正则表达式的人.

I find it unlikely that even 1 person in a 1000 would reasonably make the first of those two choices. In particular, I find it extremely unlikey that someone who asks for help with regexes as simple as those in my first solution would be a person capable of managing the regexes given in my second solution.

这真的让你只有一个选择"——如果我可以这么松散地使用这个词.

Which really leaves you with only one "choice" — if I may use that word so loosely.

这篇关于perl解析表数据的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆