使用CSS选择器从流解析器收集HTML元素(例如SAX流) [英] Use CSS selectors to collect HTML elements from a streaming parser (e.g. SAX stream)

查看:273
本文介绍了使用CSS选择器从流解析器收集HTML元素(例如SAX流)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何解析CSS(CSS3)选择器并使用它(以jQuery的方式)收集HTML元素而不是DOM(从树结构),但从使用基于顺序访问事件的解析器?

How to parse CSS (CSS3) selector and use it (in jQuery-like way) to collect HTML elements not from DOM (from tree structure), but from stream (e.g. SAX), i.e. using sequential access event based parser?

顺便说一下,是否有任何CSS选择器(或其组合)需要访问DOM(Wikipedia SAX 页面中说,XPath选择器需要能够在解析的XML树中随时访问任何节点)?

By the way, are there any CSS selectors (or their combination) that need access to DOM (Wikipedia SAX page says that XPath selectors "need to be able to access any node at any time in the parsed XML tree")?

我最感兴趣的是实施选择器组合器,例如'A B'后代选择器。

I am most interested in implementing selector combinators, e.g. 'A B' descendant selector.

我更喜欢描述算法的解决方案,或者在Perl中(

I prefer solutions describing algorithm, or in Perl (for HTML::Zoom).

推荐答案

我会使用正则表达式。

首先,将选择器转换为与表示给定解析器堆栈状态的开始标签的简单的从上到下的列表匹配的正则表达式。为了说明,这里有一些简单的选择器及其相应的正则表达式:

First, convert the selector into a regular expression that matches a simple top-to-bottom list of opening tags representing a given parser stack state. To explain, here are some simple selectors and their corresponding regexen:


  • A /< A [^>] *> $ /

  • A#someid 成为 /< A [^>] * id =someid[^>] *> $ /

  • A.someclass 变为 /< A [^>] * class =[^] * |)somclass(?= |)[^] *[^>] *> $ /

  • A> B 变为 /< A [^>] *>< B [^>] *> $ /

  • AB 变为 /< A [^>] *>(?:< ] *>)*< B [^>] *> $ /

  • A becomes /<A[^>]*>$/
  • A#someid becomes /<A[^>]*id="someid"[^>]*>$/
  • A.someclass becomes /<A[^>]*class="[^"]*(?<= |")someclass(?= |")[^"]*"[^>]*>$/
  • A > B becomes /<A[^>]*><B[^>]*>$/
  • A B becomes /<A[^>]*>(?:<[^>]*>)*<B[^>]*>$/

等等。注意,正则表达式以$结束,但不以^开头;这对应于CSS选择器不必从文档的根匹配的方式。还要注意,在类匹配代码中有一些后备和前瞻性的东西,这是必要的,所以当你想要相当不同的类someclass时,不要意外匹配someclass-super-duper。

And so on. Note that the regular expressions all end with $, but do not start with ^; this corresponds with the way CSS selectors do not have to match from the root of the document. Also note that there is some lookbehind and lookahead stuff in the class matching code, which is necessary so that you don't accidentally match against "someclass-super-duper" when you want the quite distinct class "someclass".

如果您需要更多示例,请告诉我们。

If you need more examples, please let me know.

一旦构建了选择器正则表达式,开始解析。在解析时,保持当前应用的一堆标签;每当您下降或上升时更新此堆栈。要检查选择器匹配,请将该堆栈转换为可与正则表达式匹配的标签列表。例如,请考虑此文档:

Once you've constructed the selector regex, you're ready to begin parsing. As you parse, maintain a stack of tags which currently apply; update this stack whenever you descend or ascend. To check for selector matching, convert that stack to a list of tags which can match the regular expression. For example, consider this document:

<x><a>Stuff goes here</a><y id="boo"><z class="bar">Content here</z></y></x>

您输入每个元素时,您的堆栈状态字符串将按顺序通过以下值:

Your stack state string would go through the following values in order as you enter each element:


  1. < x>

  2. < x>< a>

  3. < x>< y id =boo>

  4. < x>< y id =boo>< z class =bar>

  1. <x>
  2. <x><a>
  3. <x><y id="boo">
  4. <x><y id="boo"><z class="bar">

匹配过程很简单:每当解析器下降到一个新的元素,更新状态字符串,匹配选择器正则表达式。如果正则表达式匹配,则选择器匹配该元素!

The matching process is simple: whenever the parser descends into a new element, update the state string and check if it matches the selector regex. If the regex matches, then the selector matches that element!

需要注意的问题:


  • 双引号内部属性。为了解决这个问题,在创建正则表达式时将html实体编码应用于属性值,并在创建堆栈状态字符串时应用属性值。

  • Double quotes inside attributes. To get around this, apply html entity encoding to attribute values when creating the regex, and to attribute values when creating the stack state string.

属性顺序。当构建正则表达式和状态字符串时,为属性使用一些规范的顺序(字母顺序最简单)。否则,你可能会发现你的正则表达式为 a#someid.someclass ,它期望< a id =someidclass =someclass < a class =someclassid =someid>

Attribute order. When building both the regex and the state string, use some canonical order for the attributes (alphabetical is easiest). Otherwise, you might find that your regex for the selector a#someid.someclass which expects <a id="someid" class="someclass"> unfortunately fails when your parser goes into <a class="someclass" id="someid">.

区分大小写。根据 HTML规范,类和id属性匹配大小写(注意相应部分的CS标记)。因此,您必须使用区分大小写的正则表达式匹配。但是,在HTML中,元素名称是区分大小写,尽管它们是XML。如果您希望类似HTML的大小写不敏感的元素名称匹配,则在选择器正则表达式和状态堆栈字符串中将元素名称规范化为大写或小写。

Case sensitivity. According to the HTML spec, the class and id attributes match case sensitively (notice the 'CS' marker on the corresponding sections). So, you must use case-sensitive regex matching. However, in HTML, element names are not case sensitive, although they are in XML. If you want HTML-like case-insensitive element name matching, then canonicalize element names to either upper case or lower case in both the selector regex and the state stack string.

需要额外的魔法来处理涉及元素兄弟姐妹存在或不存在的选择器模式,即 A:first-child A + B 。您可以通过向包含紧接在前面的标记的名称的标记添加特殊属性来实现这些功能,或者如果此标记是第一个子代,则可以实现这一点。还有一般的同级选择器, A〜B ;

Additional magic is necessary to deal with the selector patterns that involve presence or absence of element siblings, namely A:first-child and A + B. You might accomplish these by adding a special attribute to the tag containing the name of the tag immediately prior, or "" if this tag is the first child. There's also the general sibling selector, A ~ B; I'm not quite sure how to deal with that one.

编辑:如果你不喜欢正则表达式hackery,你仍然可以使用这种方法来解决问题,只使用你自己的状态机,而不是正则表达式引擎。具体来说,CSS选择器可以实现为非确定性有限状态机,这是一种恐吓

EDIT: If you dislike regular expression hackery, you can still use this approach to solve the problem, only using your own state machine instead of the regex engine. Specifically, a CSS selector can be implemented as a nondeterministic finite state machine, which is an intimidating-sounding term, but just means the following in practical terms:


  1. 从任何给定状态可能有多个可能的转换

  2. 该机器尝试其中一个,如果这不工作,然后它回溯并尝试其他

  3. 实现这个的最简单的方法是保持一个堆栈的机器,你推到每当你跟随一个路径和从任何时候你需要回溯时弹出。

几乎所有的事情背后的秘密正则表达式的奇妙之处是它使用这种风格的状态机。

The secret behind nearly all of the awesomeness of regular expressions is in its use of this style of state machine.

这篇关于使用CSS选择器从流解析器收集HTML元素(例如SAX流)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆