使用html5-parser和xmls Common Lisp导航网页 [英] Navigating a webpage using html5-parser and xmls Common Lisp

查看:90
本文介绍了使用html5-parser和xmls Common Lisp导航网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试获取标题为名称"的列下的第一行,例如 https ://en.wikipedia.org/wiki/List_of_the_heaviest_people 我想返回名称"Jon Brower Minnoch".到目前为止,我的代码如下,但我认为必须有一种更通用的名称获取方式:

(defun find-tag (tag doc)
 (when (listp doc)
  (when (string= (xmls:node-name doc) tag)
   (return-from find-tag doc))
  (loop for child in (xmls:node-children doc)
   for find = (find-tag tag child)
   when find do (return-from find-tag find)))
  nil)

(defun parse-list-website (url)
  (second (second (second (third (find-tag "td" (html5-parser:parse-html5 (drakma:http-request url) :dom :xmls)))))))

然后调用该函数:

(parse-list-website "https://en.wikipedia.org/wiki/List_of_the_heaviest_people")

我对xmls不太好,也不知道如何在特定列标题下获取td.

解决方案

html5-parser:parse-html5返回的文档中的元素的格式为:

("name" (attribute-alist) &rest children)

您可以使用标准列表操作功能来访问部件,但是xmls还提供了功能node-namenode-attrsnode-children来访问这三个部件.使用它们会更清晰一些. 还有函数xmlrep-attrib-value用于获取属性的值和xmlrep-tagmatch以与标签名称匹配.子级可以是纯字符串,也可以是相同格式的元素.

例如,具有2x2表的html文档将如下所示:

(defparameter *doc*
  '("html" ()
     ("head" ()
       ("title" ()
         "Some title"))
     ("body" ()
       ("table" (("class" "some-class"))
         ("tr" (("class" "odd"))
           ("td" () "Some string")
           ("td" () "Another string"))
         ("tr" (("class" "even"))
           ("td" () "Third string")
           ("td" () "Fourth string"))))))

为了遍历dom-tree,让我们定义一个像这样的递归深度优先搜索(请注意,if-let依赖于alexandria库(将其导入或将其更改为alexandria:if-let)) :

(defun find-tag (predicate doc &optional path)
  (when (funcall predicate doc path)
    (return-from find-tag doc))

  (when (listp doc)
    (let ((path (cons doc path)))
      (dolist (child (xmls:node-children doc))
        (if-let ((find (find-tag predicate child path)))
          (return-from find-tag find))))))

通过谓词函数和文档进行调用.谓词函数被两个参数调用;匹配的元素及其祖先列表.为了找到第一个<td>,您可以执行以下操作:

(find-tag (lambda (el path)
            (declare (ignore path))
            (and (listp el)
                 (xmls:xmlrep-tagmatch "td" el)))
          *doc*)
; => ("td" NIL "Some string")

或者在偶数行中找到第一个<td>:

(find-tag (lambda (el path)
            (and (listp el)
                 (xmls:xmlrep-tagmatch "td" el)
                 (string= (xmls:xmlrep-attrib-value "class" (first path))
                          "even")))
          *doc*)
; => ("td" NIL "Third string")

在偶数行上获取第二个<td>要求如下:

(let ((matches 0))
  (find-tag (lambda (el path)
              (when (and (listp el)
                         (xmls:xmlrep-tagmatch "td" el)
                         (string= (xmls:xmlrep-attrib-value "class" (first path))
                                  "even"))
                (incf matches))
              (= matches 2))
            *doc*))

您可以定义一个辅助函数来查找第n个标签:

(defun find-nth-tag (n tag doc)
  (let ((matches 0))
    (find-tag (lambda (el path)
                (declare (ignore path))
                (when (and (listp el)
                           (xmls:xmlrep-tagmatch tag el))
                  (incf matches))
                (= matches n))
              doc)))
(find-nth-tag 2 "td" *doc*) ; => ("td" NIL "Another string")
(find-nth-tag 4 "td" *doc*) ; => ("td" NIL "Fourth string")

您可能想要一个简单的助手来获取节点的文本:

(defun node-text (el)
  (if (listp el)
      (first (xmls:node-children el))
      el))

您可以定义类似的助手来完成您在应用程序中需要做的任何事情.使用这些,您给出的示例将如下所示:

(defparameter *doc*
  (html5-parser:parse-html5
   (drakma:http-request "https://en.wikipedia.org/wiki/List_of_the_heaviest_people")
   :dom :xmls))

(node-text (find-nth-tag 1 "a" (find-nth-tag 1 "td" *doc*)))
; => "Jon Brower Minnoch"

I am trying to get the first row under the column with the title "Name" so for example for https://en.wikipedia.org/wiki/List_of_the_heaviest_people I want to return the name "Jon Brower Minnoch". My code so far is as follows, but I think there must be a more general way of getting the name:

(defun find-tag (tag doc)
 (when (listp doc)
  (when (string= (xmls:node-name doc) tag)
   (return-from find-tag doc))
  (loop for child in (xmls:node-children doc)
   for find = (find-tag tag child)
   when find do (return-from find-tag find)))
  nil)

(defun parse-list-website (url)
  (second (second (second (third (find-tag "td" (html5-parser:parse-html5 (drakma:http-request url) :dom :xmls)))))))

and then to call the function:

(parse-list-website "https://en.wikipedia.org/wiki/List_of_the_heaviest_people")

I am not very good with xmls and don't know how to get an get a td under a certain column header.

解决方案

The elements in the document returned by html5-parser:parse-html5 are in the form:

("name" (attribute-alist) &rest children)

You could access the parts with the standard list manipulation functions, but xmls also provides functions node-name, node-attrs and node-children to access the three parts. It's a little bit clearer to use those. Edit: there are also functions xmlrep-attrib-value, to get the value of an attribute and xmlrep-tagmatch to match the tag name. The children are either plain strings, or elements in the same format.

So for example, a html document with a 2x2 table would look like this:

(defparameter *doc*
  '("html" ()
     ("head" ()
       ("title" ()
         "Some title"))
     ("body" ()
       ("table" (("class" "some-class"))
         ("tr" (("class" "odd"))
           ("td" () "Some string")
           ("td" () "Another string"))
         ("tr" (("class" "even"))
           ("td" () "Third string")
           ("td" () "Fourth string"))))))

In order to traverse the dom-tree, lets define a recursive depth-first search like this (note that the if-let depends on the alexandria library (either import it, or change it to alexandria:if-let)):

(defun find-tag (predicate doc &optional path)
  (when (funcall predicate doc path)
    (return-from find-tag doc))

  (when (listp doc)
    (let ((path (cons doc path)))
      (dolist (child (xmls:node-children doc))
        (if-let ((find (find-tag predicate child path)))
          (return-from find-tag find))))))

It's called with a predicate function and a document. The predicate function gets called with two arguments; the element being matched and a list of its ancestors. In order to find the first <td>, you could do this:

(find-tag (lambda (el path)
            (declare (ignore path))
            (and (listp el)
                 (xmls:xmlrep-tagmatch "td" el)))
          *doc*)
; => ("td" NIL "Some string")

Or to find the first <td> in the even row:

(find-tag (lambda (el path)
            (and (listp el)
                 (xmls:xmlrep-tagmatch "td" el)
                 (string= (xmls:xmlrep-attrib-value "class" (first path))
                          "even")))
          *doc*)
; => ("td" NIL "Third string")

Getting the second <td> on the even row would require something like this:

(let ((matches 0))
  (find-tag (lambda (el path)
              (when (and (listp el)
                         (xmls:xmlrep-tagmatch "td" el)
                         (string= (xmls:xmlrep-attrib-value "class" (first path))
                                  "even"))
                (incf matches))
              (= matches 2))
            *doc*))

You could define a helper function to find the nth tag:

(defun find-nth-tag (n tag doc)
  (let ((matches 0))
    (find-tag (lambda (el path)
                (declare (ignore path))
                (when (and (listp el)
                           (xmls:xmlrep-tagmatch tag el))
                  (incf matches))
                (= matches n))
              doc)))
(find-nth-tag 2 "td" *doc*) ; => ("td" NIL "Another string")
(find-nth-tag 4 "td" *doc*) ; => ("td" NIL "Fourth string")

You might want to have a simple helper to get the text of a node:

(defun node-text (el)
  (if (listp el)
      (first (xmls:node-children el))
      el))

You could define similiar helpers to do whatever you need to do in your application. Using these, the example you gave would look like this:

(defparameter *doc*
  (html5-parser:parse-html5
   (drakma:http-request "https://en.wikipedia.org/wiki/List_of_the_heaviest_people")
   :dom :xmls))

(node-text (find-nth-tag 1 "a" (find-nth-tag 1 "td" *doc*)))
; => "Jon Brower Minnoch"

这篇关于使用html5-parser和xmls Common Lisp导航网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆