如何使用REBOL解析HTML标记内? [英] How to parse inside HTML tags with REBOL?
问题描述
我有一个已加载了加载/标记的网页.我需要从中解析出很多东西,但是其中一些数据在标签中.关于如何解析它的任何想法?这是到目前为止我已经(并尝试过)的示例:
I have a web page that I've loaded with load/markup. I need to parse a bunch of stuff out of it, but some of the data is in the tags. Any ideas of how I can parse it? Here's a sample of what I've got (and tried) so far:
REBOL []
mess: {
<td>Bob Sockaway</td>
<td><a href=mailto:bsockaway@example.com>bsockaway@example.com</a></td>
<td>9999</td>
}
rules: [
some [
; The expression below /will/ work, but is useless because of specificity.
; <td> <a href=mailto:bsockaway@example.com> s: string! </a> (print s/1) </td> |
; The expression below will not work, because <a> doesn't match <a mailto=...>
; <td> <a> s: string! </a> (print s/1) </td> |
<td> s: string! (print s/1) </td> |
tag! | string! ; Catch any leftovers.
]
]
解析加载/标记混乱规则
parse load/markup mess rules
这将产生:
Bob Sockaway
9999
我希望看到更多类似的东西:
I would like to see something more like:
Bob Sockaway
bsockaway@example.com
9999
有什么想法吗?谢谢!
注意!对于它的价值,我想出了一个很好的简单规则集,它将获得预期的结果:
Note! For what it's worth, I came up with a good simple ruleset that will get the desired results:
rules: [
some [
<td> any [tag!] s: string! (print s/1) any [tag!] </td> |
tag! | string! ; Catch any leftovers.
]
]
推荐答案
用LOAD/MARKUP
处理mess
时,您会得到以下提示(并且我已经格式化并加上了类型注释):
When mess
is processed with LOAD/MARKUP
you get this (and I've formatted + commented with the types):
[
; string!
"^/"
; tag! string! tag!
<td> "Bob Sockaway" </td>
; string!
"^/"
; tag! tag!
; string!
; tag! tag!
<td> <a href=mailto:bsockaway@example.com>
"bsockaway@example.com"
</a> </td>
; (Note: you didn't put the anchor's href in quotes above...)
; string!
"^/"
; tag! string! tag!
<td> "9999" </td>
; string!
"^/"
]
您的输出模式匹配形式为[<td> string! </td>]
的系列,但不匹配形式为[<td> tag! string! tag! </td>]
的事物.避开标题中提出的问题,您可以通过几种方法解决此特殊难题.一种可能是保持对您是否在TD标签中的计数,并在计数不为零时打印任何字符串:
Your output pattern matches series of the form [<td> string! </td>]
but not things of the form [<td> tag! string! tag! </td>]
. Sidestepping the question posed in your title, you could solve this particular dilemma several ways. One might be to maintain a count of whether you are inside a TD tag and print any strings when the count is non-zero:
rules: [
(td-count: 0)
some [
; if we see an open TD tag, increment a counter
<td> (++ td-count)
|
; if we see a close TD tag, decrement a counter
</td> (-- td-count)
|
; capture parse position in s if we find a string
; and if counter is > 0 then print the first element at
; the parse position (e.g. the string we just found)
s: string! (if td-count > 0 [print s/1])
|
; if we find any non-TD tags, match them so the
; parser will continue along but don't run any code
tag!
]
]
这将产生您要求的输出:
This produces the output you asked for:
Bob Sockaway
bsockaway@example.com
9999
但是您还本质上想知道您是否可以从同一规则集的块解析(不跳入开放代码)过渡到字符串解析.我研究了它的混合解析",好像它是Rebol 3中解决的功能.不过,我仍然无法在实践中使用它.所以我问了一个自己的问题.
But you also wanted to know, essentially, whether you can transition into string parsing from block parsing in the same set of rules (without jumping into open code). I looked into it "mixed parsing" looks like it may be a feature addressed in Rebol 3. Still, I couldn't get it to work in practice. So I asked a question of my own.
这篇关于如何使用REBOL解析HTML标记内?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!