使用Enlive解压缩数据 [英] Rescraping data with Enlive
问题描述
我试图创建函数来从HTML页面抓取和标记,我的URL提供给一个函数,这是工作原理。我得到< h3>
和< table>
元素的序列,当我尝试使用select函数提取只有表或h3标签从结果序列,
我get(),或者如果我尝试映射这些标签我得到(nil nil nil ...)。
您能帮我解决这个问题,还是向我解释我做错了什么?
是代码:
(ns Test2
(:require [net.cgrand.enlive-html:as html] )
(:require [clojure.string:as string]))
(defn get-page
从传递的URL获取HTML页面
[url ]
(html / html-resource(java.net.URL.URL)))
(defn h3 + table
返回< h3& table> tags
[url]
(html / select(get-page url)
{[:div#wrap:div#middle:div#content:div#prospekt:div# prs:} http://www.belex.rs/trgovanje/prospekt/VZAS/show)
line让我头痛:
(html / select(h3 + table url)[:table])
你能告诉我我做错了什么吗?
为了澄清我的问题:是否可以使用提供的select函数从(h3 +表url)的结果中只提取表标签?
正如@Julien所指出的,你可能必须使用从(html / select raw-html selectors)
上应用的深层嵌套树结构原始html。似乎你尝试多次应用 html / select
,但这不工作。 html / select
将html解析到clojure数据结构中,因此您不能再将其应用于该数据结构。
我发现解析网站实际上有点涉及,但我认为这可能是一个很好的用例多方法,所以我黑客的东西在一起,也许这将让你开始:
(此处的代码很丑陋,您也可以点击此 gist )
(ns tutorial.scrape1
(:require [net.cgrand.enlive-html:as html]))
(def * url *http://www.belex.rs/trgovanje/prospekt/VZAS/show)
(defn get-page [url]
/ html-resource(java.net.URL.URL)))
(defn content-> string [content]
(cond
(nil?content)
(string?content)content
(map?content)(content-> string(:content content))
(coll?content)(apply str(map content-> string内容))
:else(str content)))
(derived clojure.lang.PersistentStructMap :: Map)
(derived clojure.lang.PersistentArrayMap :: Map)
(derived java.lang.String :: String)
(derived clojure.lang.ISeq :: Collection)
(derived clojure.lang.PersistentList :: Collection)
clojure.lang.LazySeq :: Collection)
(defn tag-type [node]
(case(:tag node)
:tr :: CompoundNode
: table :: CompoundNode
:th :: TerminalNode
:td :: TerminalNode
:h3 :: TerminalNode
:tbody :: IgnoreNode
:: IgnoreNode))
(defmulti parse-node
(fn [node]
(let [cls(class node)] [cls(if(isa? cls :: Map)(标签类型节点)nil)])))
(defmethod parse-node [:: Map :: TerminalNode] [node]
(content-> string(:content node)))
(defmethod parse-node [:: Map :: CompoundNode] [node]
(map parse-node(:content node)))
parse-node [:: Map :: IgnoreNode] [node]
(parse-node(:content node)))
(defmethod parse-node [:: String nil] [node]
node)
(defmethod parse-node [:: Collection nil] [node]
(map parse-node node))
(defn h3 + table [url]
(let [ws-content(get-page url)
h3s + tables(html / select ws-content#{[:div#prospekt_container:h3]
[:div#prospekt_container:table ]})]
(for [node h3s + tables](parse-node node))))
$ b b发生了什么:
content-> string
将其内容收集到字符串中并返回,以便您可以将其应用于仍可能包含您要忽略的嵌套子标记(如< br />
)的内容。
派生语句建立一个临时层次结构,我们稍后将在多方法解析节点中使用。这很方便,因为我们从来不知道我们将遇到哪些数据结构,我们稍后可以轻松添加更多的情况。
标签-type
函数实际上是一个模仿层次结构语句的黑客 - AFAIK你不能创建一个层次结构的非命名空间限定关键字,所以我这样做。
多方法
parse-node
在节点的类上调度,如果节点是tag-type
。
现在我们要做的是定义适当的方法:如果我们在终端节点,内容到字符串,否则我们可以在内容上重复,或者映射我们正在处理的集合上的解析节点函数。
:: String
的方法实际上甚至没有使用,但为了安全起见,我放了它。
code> h3 + table 函数是你之前所做的,我简化了选择器,并将它们放入一个集合,不知道如果把它们放入一个地图,预期。
快乐刮!
I tried to create function to scrape and tags from HTML page, whose URL I provide to a function, and this works as it should. I get sequence of
<h3>
and<table>
elements, when I try to use select function to extract only table or h3 tags from resulting sequence, I get (), or if I try to map those tags I get (nil nil nil ...).Could you please help me to resolve this issue, or explain me what am I doing wrong?
Here is the code:
(ns Test2 (:require [net.cgrand.enlive-html :as html]) (:require [clojure.string :as string])) (defn get-page "Gets the html page from passed url" [url] (html/html-resource (java.net.URL. url))) (defn h3+table "returns sequence of <h3> and <table> tags" [url] (html/select (get-page url) {[:div#wrap :div#middle :div#content :div#prospekt :div#prospekt_container :h3] [:div#wrap :div#middle :div#content :div#prospekt :div#prospekt_container :table]} )) (def url "http://www.belex.rs/trgovanje/prospekt/VZAS/show")
This line gives me headache :
(html/select (h3+table url) [:table])
Could you please tell me what am I doing wrong?
Just to clarify my question: is it possible to use enlive's select function to extract only table tags from result of (h3+table url) ?
解决方案As @Julien pointed out, you will probably have to work with the deeply nested tree structure that you get from applying
(html/select raw-html selectors)
on the raw html. It seems like you try to applyhtml/select
multiple times, but this doesn't work.html/select
parses html into a clojure datastructure, so you can't apply it on that datastructure again.I found that parsing the website was actually a little involved, but I thought that it might be a nice use case for multimethods, so I hacked something together, maybe this will get you started:
(The code is ugly here, you can also checkout this gist)
(ns tutorial.scrape1 (:require [net.cgrand.enlive-html :as html])) (def *url* "http://www.belex.rs/trgovanje/prospekt/VZAS/show") (defn get-page [url] (html/html-resource (java.net.URL. url))) (defn content->string [content] (cond (nil? content) "" (string? content) content (map? content) (content->string (:content content)) (coll? content) (apply str (map content->string content)) :else (str content))) (derive clojure.lang.PersistentStructMap ::Map) (derive clojure.lang.PersistentArrayMap ::Map) (derive java.lang.String ::String) (derive clojure.lang.ISeq ::Collection) (derive clojure.lang.PersistentList ::Collection) (derive clojure.lang.LazySeq ::Collection) (defn tag-type [node] (case (:tag node) :tr ::CompoundNode :table ::CompoundNode :th ::TerminalNode :td ::TerminalNode :h3 ::TerminalNode :tbody ::IgnoreNode ::IgnoreNode)) (defmulti parse-node (fn [node] (let [cls (class node)] [cls (if (isa? cls ::Map) (tag-type node) nil)]))) (defmethod parse-node [::Map ::TerminalNode] [node] (content->string (:content node))) (defmethod parse-node [::Map ::CompoundNode] [node] (map parse-node (:content node))) (defmethod parse-node [::Map ::IgnoreNode] [node] (parse-node (:content node))) (defmethod parse-node [::String nil] [node] node) (defmethod parse-node [::Collection nil] [node] (map parse-node node)) (defn h3+table [url] (let [ws-content (get-page url) h3s+tables (html/select ws-content #{[:div#prospekt_container :h3] [:div#prospekt_container :table]})] (for [node h3s+tables] (parse-node node))))
A few words on what's going on:
content->string
takes a data structure and collects its content into a string and returns that so you can apply this to content that may still contain nested subtags (like<br/>
) that you want to ignore.The derive statements establish an ad hoc hierarchy which we will later use in the multi-method parse-node. This is handy because we never quite know which data structures we're going to encounter and we could easily add more cases later on.
The
tag-type
function is actually a hack that mimics the hierarchy statements - AFAIK you can't create a hierarchy out of non-namespace qualified keywords, so I did it like this.The multi-method
parse-node
dispatches on the class of the node and if the node is a map additionally on thetag-type
.Now all we have to do is define the appropriate methods: If we're at a terminal node we convert the contents to a string, otherwise we either recur on the content or map the parse-node function on the collection we're dealing with. The method for
::String
is actually not even used, but I left it in for safety.The
h3+table
function is pretty much what you had before, I simplified the selectors a bit and put them into a set, not sure if putting them into a map as you did works as intended.Happy scraping!
这篇关于使用Enlive解压缩数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!