使用Xpath,Python从网站提取信息 [英] Extract information from website using Xpath, Python

查看:104
本文介绍了使用Xpath,Python从网站提取信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试从网站中提取一些有用的信息.我来了,现在有点卡住了,需要您的帮助!

Trying to extract some useful information from a website. I came a bit now im stuck and in need of your help!

我需要此表中的信息

http://gbgfotboll.se/serier/?scr=scorers&ftid= 57700

我编写了这段代码,并且得到了我想要的信息:

I wrote this code and i got the information that i wanted:

import lxml.html
from lxml.etree import XPath

url = ("http://gbgfotboll.se/serier/?scr=scorers&ftid=57700")

rows_xpath = XPath("//*[@id='content-primary']/div[1]/table/tbody/tr")
name_xpath = XPath("td[1]//text()")
team_xpath = XPath("td[2]//text()")

league_xpath = XPath("//*[@id='content-primary']/h1//text()")


html = lxml.html.parse(url)

divName = league_xpath(html)[0]

for id,row in enumerate(rows_xpath(html)):
    scorername = name_xpath(row)[0]
    team = team_xpath(row)[0]
    print scorername, team


print divName

我收到此错误

    scorername = name_xpath(row)[0]
IndexError: list index out of range

我确实理解为什么我会收到错误消息.我真正需要帮助的是,我只需要前12行.这是摘录在以下三种可能的情况下应该执行的操作:

I do understand why i get the error. What i really need help with is that i only need the first 12 rows. This is what the extract should do in these three possible scenarios:

如果行数少于12:除最后一行"以外的所有行.

If there are less than 12 rows: Take all the rows except THE LAST ROW.

如果有12行:与上面相同.

If there are 12 rows: same as above..

如果有12行以上:只需简单地获取前12行即可.

If there are more than 12 rows: Simply take the first 12 rows.

我该怎么办?

EDIT1

这不是重复项.当然是同一站点.但是我已经做了那个家伙想要做的,就是要从行中获取所有值.我已经可以做到的.我不需要最后一行,如果有的话,我也不想它提取超过12行.

It is not a duplicate. Sure it is the same site. But i have already done what that guy wanted to which was to get all the values from the row. Which i can already do. I don't need the last row and i dont want it to extract more than 12 rows if there is..

推荐答案

我认为这是您想要的:

#coding: utf-8
from lxml import etree
import lxml.html

collected = [] #list-tuple of [(col1, col2...), (col1, col2...)]
dom = lxml.html.parse("http://gbgfotboll.se/serier/?scr=scorers&ftid=57700")
#all table rows
xpatheval = etree.XPathDocumentEvaluator(dom)
rows = xpatheval('//div[@id="content-primary"]/div/table[1]/tbody/tr')
# If there are less than 12 rows (or <=12): Take all the rows except the last.
if len(rows) <= 12:
    rows.pop() 
else:
    # If there are more than 12 rows: Simply take the first 12 rows.
    rows = rows[0:12]

for row in rows:
    # all columns of current table row (Spelare, Lag, Mal, straffmal)
    columns = row.findall("td")
    # pick textual data from each <td>
    collected.append([column.text for column in columns])

for i in collected: print i

输出:

这篇关于使用Xpath,Python从网站提取信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆