在Perl中有什么好的方法来解析HTML和CSS? [英] What are some good ways to parse HTML and CSS in Perl?

查看:185
本文介绍了在Perl中有什么好的方法来解析HTML和CSS?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个项目,其中我的输入文件以前是XML。我现在被要求开始使用嵌入式CSS处理HTML,我想完成这个干净,尽可能少的代码更改。我使用XML :: LibXML来解析XML文件,但现在我们转向使用CSS的HTML,我想我需要移动到别的东西。也就是说,在我把自己的膝盖深入到愚蠢的决定之前,我很可能会后悔,我想问一下:你们在这种任务中使用什么?

I have a project where my input files used to be XML. I'm now being asked to start processing HTML with embedded CSS instead, and I'd like to accomplish this as cleanly and with as few code changes as possible. I was using XML::LibXML to parse the XML files, but now that we're moving to HTML with CSS, I'm thinking I'll need to move to something else. That said, before I dig myself knee deep into silly decisions I'll likely regret, I wanted to ask here: what do you guys use for this kind of task?

旧XML和新HTML输入文件的结构非常相似,两者都保存相同的信息。 HTML使用div代替XML的文本节点,并且在样式标签和属性中保存其样式信息,而不是单独的xml属性。

The structures of the old XML and the new HTML input files are pretty similar, with both holding the same information. The HTML uses divs in place of the XML's text nodes, and holds its style information in style tags and attributes instead of separated xml attributes.

旧XML的示例是:

<text font="TimesNewRoman,BoldItalic" size="11.04" x="59" y="405" w="52"
      h="12" bold="yes" italic="yes" cs="4.6" o_bbox="59,405;52,12"
      o_size="11.04" o_cs="4.6">
Some text
</text>

新HTML的示例是:

<div o="9ka" style="position:absolute;top:145;left:89;x-pdf-top:744;x-pdf-left:60;x-pdf-bottom:732;x-pdf-right:536;">
  <span class="ft19" >
    Some text
  </span></nobr>
</div>

其中ft19是指从页面顶部开始的CSS样式元素,格式为: / p>

where "ft19" refers to a css style element from the top of the page of the format:

.ft19{ vertical-align:top;font-size:14px;x-pdf-font-size:14px;
       font-family:Times;color:#000000;x-pdf-color:#000000;font-style:italic;
       x-pdf-letter-spacing:0.83px;}

基本上,是一个可以读取每个节点的样式元素作为属性的解析器,所以我可以这样做:

Basically, all I want is a parser that can read the stylistic elements of each node as attributes, so I could do something like:

my @texts_arr = $page_node->findnodes('text');
my $test_node = $texts_arr[1];
print "node\'s bold value is: " . $text_node->getAttribute('bold');

。是否存在类似于解析HTML的东西?我真的想确保我开始这个正确的方式,而不是找到一些什么,我想要的CPAN,并实现两个月后,有另一个模块,是更好的我的努力。

as I'm able to do with the XML. Does anything like that exist for parsing HTML? I'd really like to make sure I start this the right way instead of finding something that sort of does what I want on CPAN and realizing two months later that there was another module that was way better for what I'm trying to do.

想法?

推荐答案

HTML :: Parser

还有一个可以使用它的项目, Marpa :: HTML 这是较大的解析器项目的工作 Marpa ,它解析可在BNF,记录在作者的博客,这是非常有趣,但更新和实验。

There is also a project that works with it, Marpa::HTML which is the work of the larger parser project Marpa, which parses any language that can be described in BNF, documented on the author's blog which is very interesting but much newer and experimental.

我也看到非常成功的WWW :: Mechanize使用 HTML :: TokeParser ,它使用 HTML :: PullParser ,因此也是如此。

I also see that wildly successful WWW::Mechanize uses HTML::TokeParser, and it uses HTML::PullParser, so there's that too.

如果你需要一些更通用的(和邪恶的),你可以看看写自己使用 Text :: Balanced (它有一些不错的标签方法,不知道标签属性),甚至 Regexp :: Grammars ,但这也意味着重新发明轮子有些时候,我只会选择这些路线,如果上述不做你需要的。

If you need something even more generic (and evil) you can look into "writing" your own using something like Text::Balanced (which has some nice methods for tags, not sure about tag properties though) or even Regexp::Grammars, but again this means reinventing the wheel somewhat, I would only choose these routes if the above don't do what you need.

也许我没有帮助。也许我刚刚为你做了一个文学搜索,但也许其中一个将比其他人更好地为你工作。

Perhaps I haven't helped. Perhaps I have just done a literature search for you, but maybe one of these will work better for you than others.

编辑:你还有一个解析器,可以做你所需要的 HTML :: Tree 。然后查看 下的 look_down code> HTML :: Element 在树上操作。我在此处查看了一个示例。

one more parser for you, seems like it might do what you need HTML::Tree. Then look at methods like look_down from HTML::Element to act on the tree. I saw an example here.

这篇关于在Perl中有什么好的方法来解析HTML和CSS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆