在Haskell中使用HXT对html表格行进行分组 [英] Group html table rows with HXT in Haskell

查看:112
本文介绍了在Haskell中使用HXT对html表格行进行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 <$ c $ 

我想处理一个(定义非常差的)html,其中包含成对行的信息。 C>< HTML>
< body>
< table>
< tr>
< td>
< font>
< a href =a> ABC< / a>< / font>
< / td>
< / tr>
< tr>
< td height =50>
< font>何时:< / font>< font> 19-1-2013< / font>
< b>< font>& nbsp; < /字体>< / B个
< font>其中:< / font>< font>这里< / font>
< font> Who:< / font>< font>我< / font>
< / td>
< / tr>
< tr>
< td>
< font>
< a href =b> EFG< / a>
< / font>
< / td>
< / tr>
< tr>
< td height =50>
< font>何时:< / font>< font> 19-2-2013< / font>
< b>< font>& nbsp; < /字体>< / B个
< font>其中:< / font>< font> There< / font>
< font> Who:< / font>< font>您< / font>
< / td>
< / tr>
< tr>
< td>
< font>
< a href =c> HIJ< / a>
< / font>
< / td>
< / tr>
< tr>
< td height =50>
< font>何时:< / font>< font> 19-3-2013< / font>< b>
< font>& nbsp; < /字体>< / B个
< font>其中:< / font>< font>远离< / font>
< font> Who:< / font>< font>他< / font>
< / td>
< / tr>
< / table>
< / body>
< / html>

为此,经过多次迭代,我得到了这个代码来实现我想要的:

  import Data.List 
import Control.Arrow.ArrowNavigatableTree
import Text.XML.HXT.Core
import Text.HandsomeSoup

group2 [] = []
group2(x0:x1:xs)= [x0,x1] :( group2 xs)

countRows html = html>>>深(hasNametr)> ;.长度

parsePage sz html = let
n x = deep(hasNametr)> ;. (( - > a !! x).group2)>>> unlistA
m = deep(hasNametd)>>> cssa/> getText
o = deep(hasNametd)>>> hasAttrheight>>> (cssfont> ;.(take 1. drop 4))>>> unlistA /> getText
p x =(((n x)>> m)&&&((n x)>> o))
in html>>> catA [p x | x< - [0..sz]]

main = do
dt< - readFiletest.html
let html = parseHtml dt
count < - (runX。countRows)html
let cnt =((head count)`div` 2) - 1
prcssd< - (runX。(parsePage cnt))html
print prcssd

结果是:
[(ABC,Here), (EFG,There),(HIJ,Far away)]

然而,我认为这不是一个很好的说法,必须先计算行数。使用HXT进行分组有更好的方法吗?我试过&&&运营商一点运气。



这个问题在用hxt提取多个html表格,尽管很有用,但我相信它提供了一个更简单的情况。

  import Text.XML.HXT.Core 
import Text.HandsomeSoup

group2 :: [a] - > [(a,a)]
group2 [] = []
group2(x0:x1:xs)=(x0,x1):group2 xs

parsePage :: ArrowXml a => XmlTree(String,String)
parsePage = let
trPairs = deep(hasNametr)>> ;. group2
insideLink = deep(hasNamea)/> getText
insideFont = deep(hasNamefont)>> ;. (取1.降4)/> getText

in trPairs>>> (insideLink *** insideFont)


main = do
dt < - readFiletest.html
let html = parseHtml dt
prcssd < - runX $ html>>> parsePage
print prcssd

>> ;。 运算符可以用来代替>。,这样就不需要调用

我改变了 group2 函数返回一个对列表,因为它可以更好地映射我们正在努力实现的内容,而且更容易处理。



trPairs

  trPairs :: ArrowXml a =>一个XmlNode(XmlNode,XmlNode)

它是接收节点并输出一对节点(即配对的< tr> 节点)的箭头。现在我们可以使用 *** Control.Arrow 将变换应用于对中的任一元素,第一个为code> insideLink ,第二个为 insideFont 。这样,我们可以通过一次遍历HTML树来收集和分组我们需要的所有内容。


I want to process a (very poorly defined) html, which has the information grouped in pairs of rows, like this:

<html>
<body>
<table>
 <tr>
     <td>
         <font >
         <a href="a">ABC</a></font>
     </td>
 </tr>
 <tr>
     <td height="50">
         <font>When:</font><font>19-1-2013</font>
          <b><font>&nbsp; </font></b>
         <font>Where:</font><font>Here</font>
         <font>Who:</font><font>Me</font>
     </td>
 </tr>
 <tr>
     <td>
        <font >
             <a href="b">EFG</a>
        </font>
     </td>
 </tr>
 <tr>
     <td height="50">
         <font>When:</font><font>19-2-2013</font>
         <b><font>&nbsp; </font></b>
         <font>Where:</font><font>There</font>
         <font>Who:</font><font>You</font>
     </td>
 </tr>
 <tr>
     <td>
        <font >
            <a href="c">HIJ</a>
        </font>
     </td>
 </tr>
 <tr>
     <td height="50">
         <font>When:</font><font>19-3-2013</font><b>
         <font>&nbsp; </font></b>
         <font>Where:</font><font>Far away</font>
         <font>Who:</font><font>Him</font>
     </td>
 </tr>
</table>
</body>
</html>

To this, after several iterations, I arrived at this code to achieve what I want:

import Data.List
import Control.Arrow.ArrowNavigatableTree
import Text.XML.HXT.Core
import Text.HandsomeSoup

group2 [] = []
group2 (x0:x1:xs) = [x0,x1]:(group2 xs)

countRows html = html >>> deep (hasName "tr") >. length

parsePage sz html = let
  n x = deep (hasName "tr") >. (( -> a !! x) . group2 ) >>> unlistA
  m = deep (hasName "td") >>> css "a" /> getText
  o = deep (hasName "td") >>> hasAttr "height" >>> (css "font" >. (take 1 . drop 4)) >>> unlistA /> getText
  p x = (((n x) >>> m) &&& ((n x) >>> o))
  in html >>> catA [p x | x <- [0..sz]]

main = do
    dt <- readFile "test.html"
    let html = parseHtml dt
    count <- (runX . countRows) html
    let cnt = ((head count) `div` 2) - 1
    prcssd <- (runX . (parsePage cnt)) html
    print prcssd

And the result is: [("ABC","Here"),("EFG","There"),("HIJ","Far away")]

However, I don't think this is a very good aproach, having to count the rows first. Is there a better way of doing this grouping using HXT? I've tried the &&& operator with little luck.

The question at extract multiples html tables with hxt, while useful, presents a simpler situation, I believe.

解决方案

Here's a somewhat simpler implementation.

import Text.XML.HXT.Core
import Text.HandsomeSoup

group2 :: [a] -> [(a, a)]
group2 [] = []
group2 (x0:x1:xs) = (x0, x1) : group2 xs

parsePage :: ArrowXml a => a XmlTree (String, String)
parsePage = let
    trPairs    = deep (hasName "tr") >>. group2
    insideLink = deep (hasName "a") /> getText
    insideFont = deep (hasName "font") >>. (take 1 . drop 4) /> getText

    in trPairs >>> (insideLink *** insideFont)


main = do
    dt <- readFile "test.html"
    let html = parseHtml dt
    prcssd <- runX $ html >>> parsePage
    print prcssd

The >>. operator can be used instead of >. so that you don't need to call unlistA afterwards.

I changed the group2 function to return a list of pairs, because it maps better with what we are trying to achieve and it's easier to work with.

The type of trPairs is

trPairs :: ArrowXml a => a XmlNode (XmlNode, XmlNode)

i.e. it's an arrow that takes in nodes and outputs a pair of nodes (i.e. the paired up <tr> nodes). Now we can use the *** operator from Control.Arrow to apply a transformation to either element of the pair, insideLink for the first one and insideFont for the second one. This way we can collect and group everything we need with a single traversal of the HTML tree.

这篇关于在Haskell中使用HXT对html表格行进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆