使用html5-parser和xmls Common Lisp导航网页 [英] Navigating a webpage using html5-parser and xmls Common Lisp
问题描述
我尝试获取标题为名称"的列下的第一行,例如 https ://en.wikipedia.org/wiki/List_of_the_heaviest_people 我想返回名称"Jon Brower Minnoch".到目前为止,我的代码如下,但我认为必须有一种更通用的名称获取方式:
(defun find-tag (tag doc)
(when (listp doc)
(when (string= (xmls:node-name doc) tag)
(return-from find-tag doc))
(loop for child in (xmls:node-children doc)
for find = (find-tag tag child)
when find do (return-from find-tag find)))
nil)
(defun parse-list-website (url)
(second (second (second (third (find-tag "td" (html5-parser:parse-html5 (drakma:http-request url) :dom :xmls)))))))
然后调用该函数:
(parse-list-website "https://en.wikipedia.org/wiki/List_of_the_heaviest_people")
我对xmls不太好,也不知道如何在特定列标题下获取td.
html5-parser:parse-html5
返回的文档中的元素的格式为:
("name" (attribute-alist) &rest children)
您可以使用标准列表操作功能来访问部件,但是xmls
还提供了功能node-name
,node-attrs
和node-children
来访问这三个部件.使用它们会更清晰一些. 还有函数xmlrep-attrib-value
用于获取属性的值和xmlrep-tagmatch
以与标签名称匹配.子级可以是纯字符串,也可以是相同格式的元素.>
例如,具有2x2表的html文档将如下所示:
(defparameter *doc*
'("html" ()
("head" ()
("title" ()
"Some title"))
("body" ()
("table" (("class" "some-class"))
("tr" (("class" "odd"))
("td" () "Some string")
("td" () "Another string"))
("tr" (("class" "even"))
("td" () "Third string")
("td" () "Fourth string"))))))
为了遍历dom-tree,让我们定义一个像这样的递归深度优先搜索(请注意,if-let
依赖于alexandria
库(将其导入或将其更改为alexandria:if-let
)) :
(defun find-tag (predicate doc &optional path)
(when (funcall predicate doc path)
(return-from find-tag doc))
(when (listp doc)
(let ((path (cons doc path)))
(dolist (child (xmls:node-children doc))
(if-let ((find (find-tag predicate child path)))
(return-from find-tag find))))))
通过谓词函数和文档进行调用.谓词函数被两个参数调用;匹配的元素及其祖先列表.为了找到第一个<td>
,您可以执行以下操作:
(find-tag (lambda (el path)
(declare (ignore path))
(and (listp el)
(xmls:xmlrep-tagmatch "td" el)))
*doc*)
; => ("td" NIL "Some string")
或者在偶数行中找到第一个<td>
:
(find-tag (lambda (el path)
(and (listp el)
(xmls:xmlrep-tagmatch "td" el)
(string= (xmls:xmlrep-attrib-value "class" (first path))
"even")))
*doc*)
; => ("td" NIL "Third string")
在偶数行上获取第二个<td>
要求如下:
(let ((matches 0))
(find-tag (lambda (el path)
(when (and (listp el)
(xmls:xmlrep-tagmatch "td" el)
(string= (xmls:xmlrep-attrib-value "class" (first path))
"even"))
(incf matches))
(= matches 2))
*doc*))
您可以定义一个辅助函数来查找第n个标签:
(defun find-nth-tag (n tag doc)
(let ((matches 0))
(find-tag (lambda (el path)
(declare (ignore path))
(when (and (listp el)
(xmls:xmlrep-tagmatch tag el))
(incf matches))
(= matches n))
doc)))
(find-nth-tag 2 "td" *doc*) ; => ("td" NIL "Another string")
(find-nth-tag 4 "td" *doc*) ; => ("td" NIL "Fourth string")
您可能想要一个简单的助手来获取节点的文本:
(defun node-text (el)
(if (listp el)
(first (xmls:node-children el))
el))
您可以定义类似的助手来完成您在应用程序中需要做的任何事情.使用这些,您给出的示例将如下所示:
(defparameter *doc*
(html5-parser:parse-html5
(drakma:http-request "https://en.wikipedia.org/wiki/List_of_the_heaviest_people")
:dom :xmls))
(node-text (find-nth-tag 1 "a" (find-nth-tag 1 "td" *doc*)))
; => "Jon Brower Minnoch"
I am trying to get the first row under the column with the title "Name" so for example for https://en.wikipedia.org/wiki/List_of_the_heaviest_people I want to return the name "Jon Brower Minnoch". My code so far is as follows, but I think there must be a more general way of getting the name:
(defun find-tag (tag doc)
(when (listp doc)
(when (string= (xmls:node-name doc) tag)
(return-from find-tag doc))
(loop for child in (xmls:node-children doc)
for find = (find-tag tag child)
when find do (return-from find-tag find)))
nil)
(defun parse-list-website (url)
(second (second (second (third (find-tag "td" (html5-parser:parse-html5 (drakma:http-request url) :dom :xmls)))))))
and then to call the function:
(parse-list-website "https://en.wikipedia.org/wiki/List_of_the_heaviest_people")
I am not very good with xmls and don't know how to get an get a td under a certain column header.
The elements in the document returned by html5-parser:parse-html5
are in the form:
("name" (attribute-alist) &rest children)
You could access the parts with the standard list manipulation functions, but xmls
also provides functions node-name
, node-attrs
and node-children
to access the three parts. It's a little bit clearer to use those. Edit: there are also functions xmlrep-attrib-value
, to get the value of an attribute and xmlrep-tagmatch
to match the tag name. The children are either plain strings, or elements in the same format.
So for example, a html document with a 2x2 table would look like this:
(defparameter *doc*
'("html" ()
("head" ()
("title" ()
"Some title"))
("body" ()
("table" (("class" "some-class"))
("tr" (("class" "odd"))
("td" () "Some string")
("td" () "Another string"))
("tr" (("class" "even"))
("td" () "Third string")
("td" () "Fourth string"))))))
In order to traverse the dom-tree, lets define a recursive depth-first search like this (note that the if-let
depends on the alexandria
library (either import it, or change it to alexandria:if-let
)):
(defun find-tag (predicate doc &optional path)
(when (funcall predicate doc path)
(return-from find-tag doc))
(when (listp doc)
(let ((path (cons doc path)))
(dolist (child (xmls:node-children doc))
(if-let ((find (find-tag predicate child path)))
(return-from find-tag find))))))
It's called with a predicate function and a document. The predicate function gets called with two arguments; the element being matched and a list of its ancestors. In order to find the first <td>
, you could do this:
(find-tag (lambda (el path)
(declare (ignore path))
(and (listp el)
(xmls:xmlrep-tagmatch "td" el)))
*doc*)
; => ("td" NIL "Some string")
Or to find the first <td>
in the even row:
(find-tag (lambda (el path)
(and (listp el)
(xmls:xmlrep-tagmatch "td" el)
(string= (xmls:xmlrep-attrib-value "class" (first path))
"even")))
*doc*)
; => ("td" NIL "Third string")
Getting the second <td>
on the even row would require something like this:
(let ((matches 0))
(find-tag (lambda (el path)
(when (and (listp el)
(xmls:xmlrep-tagmatch "td" el)
(string= (xmls:xmlrep-attrib-value "class" (first path))
"even"))
(incf matches))
(= matches 2))
*doc*))
You could define a helper function to find the nth tag:
(defun find-nth-tag (n tag doc)
(let ((matches 0))
(find-tag (lambda (el path)
(declare (ignore path))
(when (and (listp el)
(xmls:xmlrep-tagmatch tag el))
(incf matches))
(= matches n))
doc)))
(find-nth-tag 2 "td" *doc*) ; => ("td" NIL "Another string")
(find-nth-tag 4 "td" *doc*) ; => ("td" NIL "Fourth string")
You might want to have a simple helper to get the text of a node:
(defun node-text (el)
(if (listp el)
(first (xmls:node-children el))
el))
You could define similiar helpers to do whatever you need to do in your application. Using these, the example you gave would look like this:
(defparameter *doc*
(html5-parser:parse-html5
(drakma:http-request "https://en.wikipedia.org/wiki/List_of_the_heaviest_people")
:dom :xmls))
(node-text (find-nth-tag 1 "a" (find-nth-tag 1 "td" *doc*)))
; => "Jon Brower Minnoch"
这篇关于使用html5-parser和xmls Common Lisp导航网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!