在Haskell中使用HXT对html表格行进行分组 [英] Group html table rows with HXT in Haskell
问题描述
<$ c $ 我想处理一个(定义非常差的)html,其中包含成对行的信息。 C>< HTML>
< body>
< table>
< tr>
< td>
< font>
< a href =a> ABC< / a>< / font>
< / td>
< / tr>
< tr>
< td height =50>
< font>何时:< / font>< font> 19-1-2013< / font>
< b>< font>& nbsp; < /字体>< / B个
< font>其中:< / font>< font>这里< / font>
< font> Who:< / font>< font>我< / font>
< / td>
< / tr>
< tr>
< td>
< font>
< a href =b> EFG< / a>
< / font>
< / td>
< / tr>
< tr>
< td height =50>
< font>何时:< / font>< font> 19-2-2013< / font>
< b>< font>& nbsp; < /字体>< / B个
< font>其中:< / font>< font> There< / font>
< font> Who:< / font>< font>您< / font>
< / td>
< / tr>
< tr>
< td>
< font>
< a href =c> HIJ< / a>
< / font>
< / td>
< / tr>
< tr>
< td height =50>
< font>何时:< / font>< font> 19-3-2013< / font>< b>
< font>& nbsp; < /字体>< / B个
< font>其中:< / font>< font>远离< / font>
< font> Who:< / font>< font>他< / font>
< / td>
< / tr>
< / table>
< / body>
< / html>
为此,经过多次迭代,我得到了这个代码来实现我想要的:
import Data.List
import Control.Arrow.ArrowNavigatableTree
import Text.XML.HXT.Core
import Text.HandsomeSoup
group2 [] = []
group2(x0:x1:xs)= [x0,x1] :( group2 xs)
countRows html = html>>>深(hasNametr)> ;.长度
parsePage sz html = let
n x = deep(hasNametr)> ;. (( - > a !! x).group2)>>> unlistA
m = deep(hasNametd)>>> cssa/> getText
o = deep(hasNametd)>>> hasAttrheight>>> (cssfont> ;.(take 1. drop 4))>>> unlistA /> getText
p x =(((n x)>> m)&&&((n x)>> o))
in html>>> catA [p x | x< - [0..sz]]
main = do
dt< - readFiletest.html
let html = parseHtml dt
count < - (runX。countRows)html
let cnt =((head count)`div` 2) - 1
prcssd< - (runX。(parsePage cnt))html
print prcssd
结果是:
[(ABC,Here), (EFG,There),(HIJ,Far away)]
然而,我认为这不是一个很好的说法,必须先计算行数。使用HXT进行分组有更好的方法吗?我试过&&&运营商一点运气。
这个问题在用hxt提取多个html表格,尽管很有用,但我相信它提供了一个更简单的情况。
import Text.XML.HXT.Core
import Text.HandsomeSoup
group2 :: [a] - > [(a,a)]
group2 [] = []
group2(x0:x1:xs)=(x0,x1):group2 xs
parsePage :: ArrowXml a => XmlTree(String,String)
parsePage = let
trPairs = deep(hasNametr)>> ;. group2
insideLink = deep(hasNamea)/> getText
insideFont = deep(hasNamefont)>> ;. (取1.降4)/> getText
in trPairs>>> (insideLink *** insideFont)
main = do
dt < - readFiletest.html
let html = parseHtml dt
prcssd < - runX $ html>>> parsePage
print prcssd
我改变了 >> ;。
运算符可以用来代替>。
,这样就不需要调用
是
group2
函数返回一个对列表,因为它可以更好地映射我们正在努力实现的内容,而且更容易处理。
trPairs :: ArrowXml a =>一个XmlNode(XmlNode,XmlNode)
它是接收节点并输出一对节点(即配对的< tr>
节点)的箭头。现在我们可以使用 ***
Control.Arrow
将变换应用于对中的任一元素,第一个为code> insideLink ,第二个为 insideFont
。这样,我们可以通过一次遍历HTML树来收集和分组我们需要的所有内容。
I want to process a (very poorly defined) html, which has the information grouped in pairs of rows, like this:
<html>
<body>
<table>
<tr>
<td>
<font >
<a href="a">ABC</a></font>
</td>
</tr>
<tr>
<td height="50">
<font>When:</font><font>19-1-2013</font>
<b><font> </font></b>
<font>Where:</font><font>Here</font>
<font>Who:</font><font>Me</font>
</td>
</tr>
<tr>
<td>
<font >
<a href="b">EFG</a>
</font>
</td>
</tr>
<tr>
<td height="50">
<font>When:</font><font>19-2-2013</font>
<b><font> </font></b>
<font>Where:</font><font>There</font>
<font>Who:</font><font>You</font>
</td>
</tr>
<tr>
<td>
<font >
<a href="c">HIJ</a>
</font>
</td>
</tr>
<tr>
<td height="50">
<font>When:</font><font>19-3-2013</font><b>
<font> </font></b>
<font>Where:</font><font>Far away</font>
<font>Who:</font><font>Him</font>
</td>
</tr>
</table>
</body>
</html>
To this, after several iterations, I arrived at this code to achieve what I want:
import Data.List
import Control.Arrow.ArrowNavigatableTree
import Text.XML.HXT.Core
import Text.HandsomeSoup
group2 [] = []
group2 (x0:x1:xs) = [x0,x1]:(group2 xs)
countRows html = html >>> deep (hasName "tr") >. length
parsePage sz html = let
n x = deep (hasName "tr") >. (( -> a !! x) . group2 ) >>> unlistA
m = deep (hasName "td") >>> css "a" /> getText
o = deep (hasName "td") >>> hasAttr "height" >>> (css "font" >. (take 1 . drop 4)) >>> unlistA /> getText
p x = (((n x) >>> m) &&& ((n x) >>> o))
in html >>> catA [p x | x <- [0..sz]]
main = do
dt <- readFile "test.html"
let html = parseHtml dt
count <- (runX . countRows) html
let cnt = ((head count) `div` 2) - 1
prcssd <- (runX . (parsePage cnt)) html
print prcssd
And the result is: [("ABC","Here"),("EFG","There"),("HIJ","Far away")]
However, I don't think this is a very good aproach, having to count the rows first. Is there a better way of doing this grouping using HXT? I've tried the &&& operator with little luck.
The question at extract multiples html tables with hxt, while useful, presents a simpler situation, I believe.
Here's a somewhat simpler implementation.
import Text.XML.HXT.Core
import Text.HandsomeSoup
group2 :: [a] -> [(a, a)]
group2 [] = []
group2 (x0:x1:xs) = (x0, x1) : group2 xs
parsePage :: ArrowXml a => a XmlTree (String, String)
parsePage = let
trPairs = deep (hasName "tr") >>. group2
insideLink = deep (hasName "a") /> getText
insideFont = deep (hasName "font") >>. (take 1 . drop 4) /> getText
in trPairs >>> (insideLink *** insideFont)
main = do
dt <- readFile "test.html"
let html = parseHtml dt
prcssd <- runX $ html >>> parsePage
print prcssd
The >>.
operator can be used instead of >.
so that you don't need to call unlistA
afterwards.
I changed the group2
function to return a list of pairs, because it maps better with what we are trying to achieve and it's easier to work with.
The type of trPairs
is
trPairs :: ArrowXml a => a XmlNode (XmlNode, XmlNode)
i.e. it's an arrow that takes in nodes and outputs a pair of nodes (i.e. the paired up <tr>
nodes). Now we can use the ***
operator from Control.Arrow
to apply a transformation to either element of the pair, insideLink
for the first one and insideFont
for the second one. This way we can collect and group everything we need with a single traversal of the HTML tree.
这篇关于在Haskell中使用HXT对html表格行进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!