使用Enlive解压缩数据 [英] Rescraping data with Enlive

查看：226 发布时间：2016/11/27 21:48:22 clojure enlive

本文介绍了使用Enlive解压缩数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图创建函数来从HTML页面抓取和标记，我的URL提供给一个函数，这是工作原理。我得到< h3> 和< table> 元素的序列，当我尝试使用select函数提取只有表或h3标签从结果序列，
我get（），或者如果我尝试映射这些标签我得到（nil nil nil ...）。

您能帮我解决这个问题，还是向我解释我做错了什么？

是代码：

 （ns Test2 
（：require [net.cgrand.enlive-html：as html] ）
（：require [clojure.string：as string]））
 
（defn get-page 
从传递的URL获取HTML页面
 [url ] 
（html / html-resource（java.net.URL.URL）））
 
（defn h3 + table 
返回< h3& table> tags
 [url] 
（html / select（get-page url）
 {[：div＃wrap：div＃middle：div＃content：div＃prospekt：div＃ prs：} http://www.belex.rs/trgovanje/prospekt/VZAS/show）

line让我头痛：

 （html / select（h3 + table url）[：table]）

你能告诉我我做错了什么吗？

 
 
 为了澄清我的问题：是否可以使用提供的select函数从（h3 +表url）的结果中只提取表标签？
解决方案
正如@Julien所指出的，你可能必须使用从（html / select raw-html selectors）上应用的深层嵌套树结构原始html。似乎你尝试多次应用 html / select ，但这不工作。  html / select 将html解析到clojure数据结构中，因此您不能再将其应用于该数据结构。
 
 
 我发现解析网站实际上有点涉及，但我认为这可能是一个很好的用例多方法，所以我黑客的东西在一起，也许这将让你开始：
 
 
 （此处的代码很丑陋，您也可以点击此 gist ） （ns tutorial.scrape1 
（：require [net.cgrand.enlive-html：as html]））
 
（def * url *http://www.belex.rs/trgovanje/prospekt/VZAS/show）
 
（defn get-page [url] 
 / html-resource（java.net.URL.URL）））
 
（defn content-> string [content] 
（cond 
（nil？content） 
（string？content）content 
（map？content）（content-> string（：content content））
（coll？content）（apply str（map content-> string内容））
：else（str content）））
 
（derived clojure.lang.PersistentStructMap :: Map）
（derived clojure.lang.PersistentArrayMap :: Map）
（derived java.lang.String :: String）
（derived clojure.lang.ISeq :: Collection）
（derived clojure.lang.PersistentList :: Collection）
 clojure.lang.LazySeq :: Collection）
 
（defn tag-type [node] 
（case（：tag node）
：tr :: CompoundNode 
： table :: CompoundNode 
：th :: TerminalNode 
：td :: TerminalNode 
：h3 :: TerminalNode 
：tbody :: IgnoreNode 
 :: IgnoreNode））
 
（defmulti parse-node 
（fn [node] 
（let [cls（class node）] [cls（if（isa？ cls :: Map）（标签类型节点）nil）]）））
 
（defmethod parse-node [:: Map :: TerminalNode] [node] 
（content-> string（：content node）））
（defmethod parse-node [:: Map :: CompoundNode] [node] 
（map parse-node（：content node）））
 parse-node [:: Map :: IgnoreNode] [node] 
（parse-node（：content node）））
（defmethod parse-node [:: String nil] [node] 
 node）
（defmethod parse-node [:: Collection nil] [node] 
（map parse-node node））
 
（defn h3 + table [url] 
（let [ws-content（get-page url）
 h3s + tables（html / select ws-content＃{[：div＃prospekt_container：h3] 
 [：div＃prospekt_container：table ]}）] 
（for [node h3s + tables]（parse-node node））））
  
 
 $ b b 发生了什么：
 
 
   content-> string 将其内容收集到字符串中并返回，以便您可以将其应用于仍可能包含您要忽略的嵌套子标记（如< br /> ）的内容。 
 
 
 派生语句建立一个临时层次结构，我们稍后将在多方法解析节点中使用。这很方便，因为我们从来不知道我们将遇到哪些数据结构，我们稍后可以轻松添加更多的情况。
 
 
  标签-type 函数实际上是一个模仿层次结构语句的黑客 -  AFAIK你不能创建一个层次结构的非命名空间限定关键字，所以我这样做。
 
 
 多方法 parse-node 在节点的类上调度，如果节点是 tag-type 。
 
 
 现在我们要做的是定义适当的方法：如果我们在终端节点，内容到字符串，否则我们可以在内容上重复，或者映射我们正在处理的集合上的解析节点函数。  :: String 的方法实际上甚至没有使用，但为了安全起见，我放了它。
 
 
  code> h3 + table 函数是你之前所做的，我简化了选择器，并将它们放入一个集合，不知道如果把它们放入一个地图，预期。
 
 
 快乐刮！
 
I tried to create function to scrape  and  tags from HTML page, whose URL I provide to a function, and this works as it should. I get sequence of <h3> and <table> elements, when I try to use select function to extract only table or h3 tags from resulting sequence, 
I get (), or if I try to map those tags I get (nil nil nil ...). 

Could you please help me to resolve this issue, or explain me what am I doing wrong?

Here is the code:
(ns Test2 
  (:require [net.cgrand.enlive-html :as html]) 
  (:require [clojure.string :as string])) 

(defn get-page 
  "Gets the html page from passed url" 
  [url] 
  (html/html-resource (java.net.URL. url))) 

(defn h3+table       
    "returns sequence of <h3> and <table> tags"
  [url] 
  (html/select (get-page url) 
{[:div#wrap :div#middle :div#content :div#prospekt :div#prospekt_container :h3] 
[:div#wrap :div#middle :div#content :div#prospekt :div#prospekt_container :table]} 
               )) 

(def url "http://www.belex.rs/trgovanje/prospekt/VZAS/show")
This line gives me headache :
(html/select (h3+table url) [:table])
Could you please tell me what am I doing wrong?

Just to clarify my question: is it possible to use enlive's select function to extract only table tags from result of (h3+table url) ?
 解决方案 
As @Julien pointed out, you will probably have to work with the deeply nested tree structure that you get from applying (html/select raw-html selectors)  on the raw html. It seems like you try to apply html/select multiple times, but this doesn't work. html/select parses html into a clojure datastructure, so you can't apply it on that datastructure again.

I found that parsing the website was actually a little involved, but I thought that it might be a nice use case for multimethods, so I hacked something together, maybe this will get you started:

(The code is ugly here, you can also checkout this gist)
(ns tutorial.scrape1
  (:require [net.cgrand.enlive-html :as html]))

(def *url* "http://www.belex.rs/trgovanje/prospekt/VZAS/show")

(defn get-page [url] 
  (html/html-resource (java.net.URL. url))) 

(defn content->string [content]
  (cond
   (nil? content)    ""
   (string? content) content
   (map? content)    (content->string (:content content))
   (coll? content)   (apply str (map content->string content))
   :else             (str content)))

(derive clojure.lang.PersistentStructMap ::Map)
(derive clojure.lang.PersistentArrayMap  ::Map)
(derive java.lang.String                 ::String)
(derive clojure.lang.ISeq                ::Collection)
(derive clojure.lang.PersistentList      ::Collection)
(derive clojure.lang.LazySeq             ::Collection)

(defn tag-type [node]
  (case (:tag node) 
   :tr    ::CompoundNode
   :table ::CompoundNode
   :th    ::TerminalNode
   :td    ::TerminalNode
   :h3    ::TerminalNode
   :tbody ::IgnoreNode
   ::IgnoreNode))

(defmulti parse-node
  (fn [node]
    (let [cls (class node)] [cls (if (isa? cls ::Map) (tag-type node) nil)])))

(defmethod parse-node [::Map ::TerminalNode] [node]
  (content->string (:content node)))
(defmethod parse-node [::Map ::CompoundNode] [node]
  (map parse-node (:content node)))
(defmethod parse-node [::Map ::IgnoreNode] [node]
  (parse-node (:content node)))
(defmethod parse-node [::String nil] [node]
  node)
(defmethod parse-node [::Collection nil] [node]
  (map parse-node node))

(defn h3+table [url] 
 (let [ws-content (get-page url)
       h3s+tables (html/select ws-content #{[:div#prospekt_container :h3]
                                            [:div#prospekt_container :table]})]
   (for [node h3s+tables] (parse-node node)))) 
A few words on what's going on:

content->string takes a data structure and collects its content into a string and returns that so you can apply this to content that may still contain nested subtags (like <br/>) that you want to ignore.

The derive statements establish an ad hoc hierarchy which we will later use in the multi-method parse-node. This is handy because we never quite know which data structures we're going to encounter and we could easily add more cases later on.

The tag-type function is actually a hack that mimics the hierarchy statements - AFAIK you can't create a hierarchy out of non-namespace qualified keywords, so I did it like this.

The multi-method parse-node dispatches on the class of the node and if the node is a map additionally on the tag-type.

Now all we have to do is define the appropriate methods: If we're at a terminal node we convert the contents to a string, otherwise we either recur on the content or map the parse-node function on the collection we're dealing with. The method for ::String is actually not even used, but I left it in for safety.

The h3+table function is pretty much what you had before, I simplified the selectors a bit and put them into a set, not sure if putting them into a map as you did works  as intended.

Happy scraping!

                        这篇关于使用Enlive解压缩数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

使用Enlive解压缩数据 [英] Rescraping data with Enlive

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

使用Enlive解压缩数据 [英] Rescraping data with Enlive

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭