在Common Lisp中抓取HTML表? [英] Scraping an HTML table in Common Lisp?
问题描述
我想从HTML< table>中包含的网页中提取一些信息。如何将所有表信息提取到一个漂亮的表中?
I'd like to extract some information from a web page that's contained in an HTML <table>. How can I extract all the table information into a nice | separated file?
Author|Book|Year|Comments
Bill Bryson|Short History of Nearly Everything|2004
Stephen Hawking|A Brief History of Time|1998|Still haven't read.
理想情况下,我想有一个函数,该函数将URL和输出文件作为参数,然后提供上述输出。
Ideally, I'd like to have a function that takes a URL and output file as parameters then gives the above output.
(defun extract-table (url filename)
(extract-from-html-table (fetch-web-page url)))
(extract-table "http://www.mypage.com" "output.txt")
以上输出的示例HTML输入:
Sample HTML input for the above output:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
<head>
<title>Lisp</title>
</head>
<body>
<h1>Welcome to Lisp</h1>
<table class="any" style="font-size: 14px;">
<TR class="header">
<td>Author</td>
<TD>Book</TD>
<td>Year</td>
<td>Comments</td>
</TR>
<tr class="odd">
<td>Bill Bryson</td>
<td>Short History of Nearly Everything</td>
<td>2004</td>
</tr>
<tr>
<td>Stephen Hawking</td>
<td>A Brief History of Time</td>
<td>1998</td>
<td>Still haven't read.</td>
</tr>
</table>
</body>
</html>
推荐答案
以 Drakma 用于获取数据。要解析该内容,您可能会发现 cxml 很有帮助。或更妙的是:您可以使用 closure-html ,它应该解析任意HTML4。closure-html包的Common-Lisp.net页面具有屏幕抓取示例。
Start with Drakma for fetching the data. To parse the thing, you might find cxml helpful. Or better yet: you could use closure-html, which should parse arbitrary HTML 4. The Common-Lisp.net page of the closure-html package has a screen scraping example.
这篇关于在Common Lisp中抓取HTML表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!