在Haskell中解析html [英] Parsing html in haskell

查看:113
本文介绍了在Haskell中解析html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从博客文章的主要部分(<article>)解析a链接.我已对在 FPComplete ,但是什么也没有打印出来. (据我所见,该代码无法正常运行,因为它是在在线IDE上运行的,并且使用Bing目标也不会产生任何链接.)

I'm trying to parse the a links from the main part (<article>) of a blog post. I have adapted what I found on FPComplete but nothing is printed out. (The code does not work as far as I can see as running it on the online IDE and with the Bing target also produces no links.)

在GHCI中,我可以模拟parseAF的第一行,这使我获得了很大的记录,我认为这是正确的.但是cursor $// findNodes &| extractData返回[]

In GHCI I can simulate the first line of parseAF and that gets me a large record, which I take to be correct. But cursor $// findNodes &| extractData returns []

我已经尝试过正则表达式,但是对找到这么长的文本并不满意.

I've tried regex but that wasn't happy trying to find such a long piece of text.

任何人都可以帮忙吗?

{-# LANGUAGE OverloadedStrings #-}

module HtmlParser where

import Network.HTTP.Conduit (simpleHttp)
import Prelude hiding (concat, putStrLn)
import Data.Text (concat)
import Data.Text.IO (putStrLn)
import Text.HTML.DOM (parseLBS)
import Text.XML.Cursor (Cursor, attribute, element, fromDocument, ($//), (&//), (&/), (&|))

-- The URL we're going to search
url = "http://www.amsterdamfoodie.nl/2015/wine-beer-food-restaurants-troost/"

-- The data we're going to search for
findNodes :: Cursor -> [Cursor]
findNodes = element "article" &/ element "a"

-- Extract the data from each node in turn
extractData = concat . attribute "href"

cursorFor :: String -> IO Cursor
cursorFor u = do
     page <- simpleHttp u
     return $ fromDocument $ parseLBS page

-- Process the list of data elements
processData = mapM_ putStrLn

-- main = do
parseAF :: IO ()
parseAF = do
     cursor <- cursorFor url
     processData $ cursor $// findNodes &| extractData

更新经过更多探索,看来问题出在element "article"上.如果我用element "p"替换它,在这种情况下可以这样做,因为无论如何只有particle中,那么我就得到了链接.太奇怪了...!

UPDATE After more exploring it seems that the problem lies with element "article". If I replace that with element "p", which is OK in this instance as the only ps are in the article anyway, then I get my links. Pretty weird....!!

推荐答案

我认为您可以通过编写过滤器以非常可读的方式使用HXT来做到这一点:

I think you can do this in a very readable way with HXT by composing filters:

{-# LANGUAGE Arrows #-}

import Text.XML.HXT.Core
import Text.XML.HXT.Curl
import Text.XML.HXT.TagSoup

links url = extract (readDocument
  [ withParseHTML yes
  , withTagSoup
  , withCurl      []
  , withWarnings  no
  ] url)

extract doc = runX $ doc >>> xmlFilter "article" >>> xmlFilter "a" >>> toHref

xmlFilter name = deep (hasName name)

toHref = proc el -> do
   link    <- getAttrValue "href" -< el
   returnA -< link

您可以通过以下方式调用它:

You can call this in the following way:

links "http://www.amsterdamfoodie.nl/2015/wine-beer-food-restaurants-troost/"

这篇关于在Haskell中解析html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆