用hxt提取多个html表格 [英] extract multiples html tables with hxt

查看:127
本文介绍了用hxt提取多个html表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是我必须从html文档中提取所有表格,并将它们放在表格列表中。



因此,我明白结束函数类型应该是

  getTable :: a [XmlTree] [[String]] 

例如使用以下xml:

 < table class =t1> 
< tr>
< td> x< / td>
< td> y< / td>
< / tr>
< tr>
< td> a< / td>
< td> b< / td>
< / tr>
< / table>
< table class =t2>
< tr>
< td> 3< / td>
< td> 5< / td>
< / tr>
< tr>
< td> toto< / td>
< td> titi< / td>
< / tr>
< / table>

我知道如何从一个xmlTree(example1)或所有标签tables它为我提供了[XmlTree]类型,但我不知道如何在test2的结果中映射箭头example1。



我确定它很明显,但我找不到它。

  test2 :: IO [[XmlTree]] 
test2 = runX $ parseXMLtable.xml>> ;>是表格>>> listA getChildren

example1 :: ArrowXml a => XmlTree [String]
example1 = istable/> listA(getChildren>>>是td/> getText)


解决方案

使用与 example1 相同的一般思想,我们可以这样写 getTable

  getTable :: ArrowXml a => XmlTree [[String]] 
getTable = hasNametable>>> listA(rows>>> listA cols)其中
rows = getChildren>>> hasNametr
cols = getChildren>>> hasNametd/> getText

运行示例文档中的箭头产生

  [[[ X, Y],[ 一, b]],[[ 3, 5],[ TOTO ,titi]]] 


my problem is i have to extracts all of tables from an html document and put them in a list of tables.

Hence i understand that the ending function type should be

getTable :: a [XmlTree] [[String]]

for example with the following xml:

<table class="t1">
<tr>
    <td>x</td>
    <td>y</td>
</tr>
<tr>
    <td>a</td>
    <td>b</td>
</tr>
</table>
<table class="t2">
<tr>
    <td>3</td>
    <td>5</td>
</tr>
<tr>
    <td>toto</td>
    <td>titi</td>
</tr>
</table>

i know how to retrieve all the rows from one xmlTree (example1) or all the tags "tables" which provides me the type [XmlTree], but i don't know how to map the arrow example1 inside the result of test2.

I'm sure its obvious but i can't find it.

test2 ::  IO [[XmlTree]]
test2 = runX $ parseXML "table.xml" >>> is "table">>> listA getChildren

example1 ::  ArrowXml a => a XmlTree [String]
example1  = is "table" /> listA (getChildren >>> is "td"  /> getText)

解决方案

Using the same general idea that you have in example1, we can write getTable like this

getTable :: ArrowXml a => a XmlTree [[String]]
getTable =  hasName "table" >>> listA (rows >>> listA cols) where
    rows = getChildren >>> hasName "tr"
    cols = getChildren >>> hasName "td" /> getText

Running the arrow on your example document produces

[[["x","y"],["a","b"]],[["3","5"],["toto","titi"]]]

这篇关于用hxt提取多个html表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆