使用python lxml xpath遍历表中的所有行 [英] Iterate through all the rows in a table using python lxml xpath

查看:561
本文介绍了使用python lxml xpath遍历表中的所有行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我要从中提取数据的html页面的源代码。

This is the source code of the html page I want to extract data from.

网页: http://gbgfotboll.se/information/?scr=table&ftid=51168 该表位于页面底部

Webpage: http://gbgfotboll.se/information/?scr=table&ftid=51168 The table is at the bottom of the page

     <html>
               <table class="clCommonGrid" cellspacing="0">
                        <thead>
                            <tr>
                                <td colspan="3">Kommande matcher</td>
                            </tr>
                            <tr>
                                <th style="width:1%;">Tid</th>
                                <th style="width:69%;">Match</th>
                                <th style="width:30%;">Arena</th>
                            </tr>
                        </thead>

                        <tbody class="clGrid">

                    <tr class="clTrOdd">
                        <td nowrap="nowrap" class="no-line-through">
                            <span class="matchTid"><span>2014-09-26<!-- br ok --> 19:30</span></span>



                        </td>
                        <td><a href="?scr=result&amp;fmid=2669197">Guldhedens IK - IF Warta</a></td>
                        <td><a href="?scr=venue&amp;faid=847">Guldheden Södra 1 Konstgräs</a> </td>
                    </tr>

                    <tr class="clTrEven">
                        <td nowrap="nowrap" class="no-line-through">
                            <span class="matchTid"><span>2014-09-26<!-- br ok --> 13:00</span></span>



                        </td>
                        <td><a href="?scr=result&amp;fmid=2669176">Romelanda UF - IK Virgo</a></td>
                        <td><a href="?scr=venue&amp;faid=941">Romevi 1 Gräs</a> </td>
                    </tr>

                    <tr class="clTrOdd">
                    <td nowrap="nowrap" class="no-line-through">
                        <span class="matchTid"><span>2014-09-27<!-- br ok --> 13:00</span></span>



                    </td>
                    <td><a href="?scr=result&amp;fmid=2669167">Kode IF - IK Kongahälla</a></td>
                    <td><a href="?scr=venue&amp;faid=912">Kode IP 1 Gräs</a> </td>
                </tr>

                <tr class="clTrEven">
                    <td nowrap="nowrap" class="no-line-through">
                        <span class="matchTid"><span>2014-09-27<!-- br ok --> 14:00</span></span>



                    </td>
                    <td><a href="?scr=result&amp;fmid=2669147">Floda BoIF - Partille IF FK </a></td>
                    <td><a href="?scr=venue&amp;faid=218">Flodala IP 1</a> </td>
                </tr>


                        </tbody>
                </table>
        </html>

现在我有这个代码实际上产生了我想要的结果..

Right now i have this code that actually produces the result that i want..

import lxml.html
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
for i in range(12):
    xpath1 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span//text()" %(i+1)
    xpath2 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
    time = html.xpath(xpath1)[1]
    date = html.xpath(xpath1)[0]
    teamName = html.xpath(xpath2)[0]
    if date == '2014-09-27':
        print time, teamName

给出结果:

13 :00 Romelanda UF - IK Virgo

13:00 Romelanda UF - IK Virgo

13:00 Kode IF - IKKongahälla

13:00 Kode IF - IK Kongahälla

14:00 Floda BoIF - Partille IF FK

14:00 Floda BoIF - Partille IF FK

现在回答这个问题。我不想使用带有范围的循环,因为它不稳定,行可以在该表中更改,如果超出范围,它将崩溃。所以我的问题是如何以安全的方式迭代。 意味着它将遍历表中可用的所有行。不多也不少。 此外,如果您有任何其他建议使代码更好/更快,请继续。

Now to the question. I don't want to use for loop with range because its not stable, the rows can change in that table and if it goes out of bounds it will crash. So my question is how can I iterate as I do here in a safe way. Meaning it will iterate through all the rows that are available in the table. No more no less. Also if you have any other suggestion making the code better/faster please go ahead.

推荐答案

以下代码将迭代任何行数。 rows_xpath将直接过滤目标日期。 xpath也在for循环之外创建一次,因此它应该更快。

The following code will iterate whatever the number of rows. The rows_xpath will directly filter on the target date. The xpaths are also created once, outside the for loop, so it should be faster.

import lxml.html
from lxml.etree import XPath
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
date = '2014-09-27'

rows_xpath = XPath("//*[@id='content-primary']/table[3]/tbody/tr[td[1]/span/span//text()='%s']" % (date))
time_xpath = XPath("td[1]/span/span//text()[2]")
team_xpath = XPath("td[2]/a/text()")

html = lxml.html.parse(url)

for row in rows_xpath(html):
    time = time_xpath(row)[0].strip()
    team = team_xpath(row)[0]
    print time, team

这篇关于使用python lxml xpath遍历表中的所有行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆