用perl去除HTML标记 [英] strip HTML Tags with perl

查看:113
本文介绍了用perl去除HTML标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最简单的方法是剥离perl中的HTML标签。我正在使用正则表达式来解析来自URL的HTML,这很好用,但我怎样才能去掉HTML标签?



这是我如何拉我的HTML

p>

 #!/ usr / bin / perl -w 
use strict;
使用警告;
使用LWP :: Simple;
my $ now_string = localtime;

my $ html = get(http://www.spc.noaa.gov/climo/reports/last3hours.html)
或者无法获取NWS页面。 ;
$ html =〜s /< script。*?< \'/ script> / sg;
$ html =〜s /<。+?> // sg;
$ html =〜m {(Hail Reports。*)Wind Reports} s ||死;
我的@hail = $ 1;


解决方案

试图回答你错误的问题






问题






正因为HTML有很多规则和方法来解决这个问题,所以最终会将您的代码打开成为黑客技术,这是一个坏习惯。虽然现在您可能对某些简单事物有合理的需求,但是重用代码并忘记为什么重复使用它不是个好主意,尤其是当您不添加注释时,例如#此代码为不安全,不应该用于解析其他地方的HTML !!! #Christina Alguilera根据此代码编写歌曲!!!



需要大量正则表达式规则的HTML差异示例:



 < ; DIV> ...< / DIV> 
< div style =blah>
< div style =background:url(../ div)>
< div style =..class ='noticesinglequote'>

这个列表继续存在,仅适用于格式良好的HTML。其他一些问题包括:


  1. HTML元素关闭不当(例如< div>< span><< (例如< dvi> ..< / c> / b>

  2. 其他问题:注释,空格,字符集,等等



解决方案






您可能已经接受了答案,但您应该查看 XML: :Parser HTML :: TreeBuilder

与剥去部分HTML文档相比,您可能更愿意深入到您想要的文档部分(例如< body> 或其中的某个 div ),这就是为什么你很可能想要上面模块提供的东西。更何况,解析器可以用来在移除所有HTML元素并仅返回文本/ CData方面发挥最大作用。


Whats the easiest way to strip the HTML tags in perl. I am using a regular expression to parse HTML from a URL which works great but how can I strip the HTML tags off?

Here is how I am pulling my HTML

 #!/usr/bin/perl -w
use strict;
use warnings;
use LWP::Simple;
my $now_string = localtime;

my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html")
    or die "Could not fetch NWS page.";
$html =~ s/<script.*?<\'/script>/sg;
$html =~ s/<.+?>//sg;
$html =~ m{(Hail Reports.*)Wind Reports}s || die;
my @hail = $1;

解决方案

An attempt to answer your misguided question


Problems


It's a bad habit to get into regex'ing out HTML because there are so many rules and ways to get around them, that may eventually open your code up to hacking techniques. While you might have a legitimate need for something simple now, it is very easy to reuse code and forget why it was a bad idea to reuse it, especially when you don't add comments like # This code is NOT secure and should not be used to parse HTML anywhere else!!! or # Christina Alguilera writes songs based on this code!!!

Example of differences in HTML that require lots of regex rules:

<div>...</div>
<div style="blah">
<div style="background:url(../div)">
<div style=".." class='noticesinglequote'>

The list goes on and that's only for well-formed HTML. Some other examples of problems include:

  1. HTML elements closed improperly (eg <div><span></div></span>) or not at all
  2. Spelling errors (eg <dvi>..</div>)
  3. HTML designed with the intention to break your script
  4. Other issues: comments, whitespaces, charsets, etc

Solution


You may have accepted an answer, but you should look at XML::Parser and HTML::TreeBuilder.

Rather than stripping out parts of the HTML Document, you are probably more interested in drilling down to the part of the document you want (eg everything in <body> or a certain div inside of it), which is why you most likely want something that one of the above modules provide. Not to mention, parsers can be used to do their best at removing all HTML elements and returning only text/CData.

这篇关于用perl去除HTML标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆