在Haskell中使用HXT对html表格行进行分组 [英] Group html table rows with HXT in Haskell

查看：112 发布时间：2018/6/4 17:22:19 haskell html-table hxt

本文介绍了在Haskell中使用HXT对html表格行进行分组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

 <$ c $ 
我想处理一个（定义非常差的）html，其中包含成对行的信息。 C>< HTML> 
< body> 
< table> 
< tr> 
< td> 
< font> 
< a href =a> ABC< / a>< / font> 
< / td> 
< / tr> 
< tr> 
< td height =50> 
< font>何时：< / font>< font> 19-1-2013< / font> 
< b>< font>& nbsp; < /字体>< / B个
< font>其中：< / font>< font>这里< / font> 
< font> Who：< / font>< font>我< / font> 
< / td> 
< / tr> 
< tr> 
< td> 
< font> 
< a href =b> EFG< / a> 
< / font> 
< / td> 
< / tr> 
< tr> 
< td height =50> 
< font>何时：< / font>< font> 19-2-2013< / font> 
< b>< font>& nbsp; < /字体>< / B个
< font>其中：< / font>< font> There< / font> 
< font> Who：< / font>< font>您< / font> 
< / td> 
< / tr> 
< tr> 
< td> 
< font> 
< a href =c> HIJ< / a> 
< / font> 
< / td> 
< / tr> 
< tr> 
< td height =50> 
< font>何时：< / font>< font> 19-3-2013< / font>< b> 
< font>& nbsp; < /字体>< / B个
< font>其中：< / font>< font>远离< / font> 
< font> Who：< / font>< font>他< / font> 
< / td> 
< / tr> 
< / table> 
< / body> 
< / html>

为此，经过多次迭代，我得到了这个代码来实现我想要的：

  import Data.List 
 import Control.Arrow.ArrowNavigatableTree 
 import Text.XML.HXT.Core 
 import Text.HandsomeSoup 
 
 group2 [] = [] 
 group2（x0：x1：xs）= [x0，x1] :( group2 xs）
 
 countRows html = html>>>深（hasNametr）> ;.长度
 
 parsePage sz html = let 
 n x = deep（hasNametr）> ;. （（ - > a !! x）.group2）>>> unlistA 
 m = deep（hasNametd）>>> cssa/> getText 
 o = deep（hasNametd）>>> hasAttrheight>>> （cssfont> ;.（take 1. drop 4））>>> unlistA /> getText 
 p x =（（（n x）>> m）&&&（（n x）>> o））
 in html>>> catA [p x | x<  -  [0..sz]] 
 
 main = do 
 dt<  -  readFiletest.html
 let html = parseHtml dt 
 count <  - （runX。countRows）html 
 let cnt =（（head count）`div` 2） -  1 
 prcssd<  - （runX。（parsePage cnt））html 
 print prcssd

结果是：
[（ABC，Here），（EFG，There），（HIJ，Far away）]

然而，我认为这不是一个很好的说法，必须先计算行数。使用HXT进行分组有更好的方法吗？我试过&&&运营商一点运气。

这个问题在用hxt提取多个html表格，尽管很有用，但我相信它提供了一个更简单的情况。

  import Text.XML.HXT.Core 
 import Text.HandsomeSoup 
 
 group2 :: [a]  - > [（a，a）] 
 group2 [] = [] 
 group2（x0：x1：xs）=（x0，x1）：group2 xs 
 
 parsePage :: ArrowXml a => XmlTree（String，String）
 parsePage = let 
 trPairs = deep（hasNametr）>> ;. group2 
 insideLink = deep（hasNamea）/> getText 
 insideFont = deep（hasNamefont）>> ;. （取1.降4）/> getText 
 
 in trPairs>>> （insideLink *** insideFont）
 
 
 main = do 
 dt < -  readFiletest.html
 let html = parseHtml dt 
 prcssd <  -  runX $ html>>> parsePage 
 print prcssd

>> ;。 运算符可以用来代替>。，这样就不需要调用

我改变了 group2 函数返回一个对列表，因为它可以更好地映射我们正在努力实现的内容，而且更容易处理。

trPairs 是

  trPairs :: ArrowXml a =>一个XmlNode（XmlNode，XmlNode）

它是接收节点并输出一对节点（即配对的< tr> 节点）的箭头。现在我们可以使用 *** Control.Arrow 将变换应用于对中的任一元素，第一个为code> insideLink ，第二个为 insideFont 。这样，我们可以通过一次遍历HTML树来收集和分组我们需要的所有内容。

I want to process a (very poorly defined) html, which has the information grouped in pairs of rows, like this:
<html> <body> <table> <tr> <td> <a href="a">ABC</a> </td> </tr> <tr> <td height="50"> When:19-1-2013   Where:Here Who:Me </td> </tr> <tr> <td> <a href="b">EFG</a> </td> </tr> <tr> <td height="50"> When:19-2-2013   Where:There Who:You </td> </tr> <tr> <td> <a href="c">HIJ</a> </td> </tr> <tr> <td height="50"> When:19-3-2013   Where:Far away Who:Him </td> </tr> </table> </body> </html>
To this, after several iterations, I arrived at this code to achieve what I want:
import Data.List import Control.Arrow.ArrowNavigatableTree import Text.XML.HXT.Core import Text.HandsomeSoup group2 [] = [] group2 (x0:x1:xs) = [x0,x1]:(group2 xs) countRows html = html >>> deep (hasName "tr") >. length parsePage sz html = let n x = deep (hasName "tr") >. (( -> a !! x) . group2 ) >>> unlistA m = deep (hasName "td") >>> css "a" /> getText o = deep (hasName "td") >>> hasAttr "height" >>> (css "font" >. (take 1 . drop 4)) >>> unlistA /> getText p x = (((n x) >>> m) &&& ((n x) >>> o)) in html >>> catA [p x | x <- [0..sz]] main = do dt <- readFile "test.html" let html = parseHtml dt count <- (runX . countRows) html let cnt = ((head count) `div` 2) - 1 prcssd <- (runX . (parsePage cnt)) html print prcssd
And the result is: [("ABC","Here"),("EFG","There"),("HIJ","Far away")]

However, I don't think this is a very good aproach, having to count the rows first. Is there a better way of doing this grouping using HXT? I've tried the &&& operator with little luck.

The question at extract multiples html tables with hxt, while useful, presents a simpler situation, I believe.
解决方案
Here's a somewhat simpler implementation.
import Text.XML.HXT.Core import Text.HandsomeSoup group2 :: [a] -> [(a, a)] group2 [] = [] group2 (x0:x1:xs) = (x0, x1) : group2 xs parsePage :: ArrowXml a => a XmlTree (String, String) parsePage = let trPairs = deep (hasName "tr") >>. group2 insideLink = deep (hasName "a") /> getText insideFont = deep (hasName "font") >>. (take 1 . drop 4) /> getText in trPairs >>> (insideLink *** insideFont) main = do dt <- readFile "test.html" let html = parseHtml dt prcssd <- runX $ html >>> parsePage print prcssd
The >>. operator can be used instead of >. so that you don't need to call unlistA afterwards.

I changed the group2 function to return a list of pairs, because it maps better with what we are trying to achieve and it's easier to work with.

The type of trPairs is
trPairs :: ArrowXml a => a XmlNode (XmlNode, XmlNode)
i.e. it's an arrow that takes in nodes and outputs a pair of nodes (i.e. the paired up <tr> nodes). Now we can use the *** operator from Control.Arrow to apply a transformation to either element of the pair, insideLink for the first one and insideFont for the second one. This way we can collect and group everything we need with a single traversal of the HTML tree.

这篇关于在Haskell中使用HXT对html表格行进行分组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Haskell中使用HXT对html表格行进行分组 [英] Group html table rows with HXT in Haskell

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Haskell中使用HXT对html表格行进行分组 [英] Group html table rows with HXT in Haskell

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭