如何从html表格元素解析文本 [英] How to parse text from a html table element

查看:100
本文介绍了如何从html表格元素解析文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用python请求和lxml库编写一个小型测试网络爬虫.我正在尝试使用xpaths从此站点的表行中提取文本识别表.由于表本身只能通过其类名来识别,并且鉴于类名不是唯一的事实,我不得不使用父div元素来指定表.有问题的表格列出了《权力的游戏》节目的演出季顺序,拍摄和播出日期,我正尝试通过以下路径进行选择:

I'm currently writing a small test webscraper using the python requests and lxml libraries. I'm trying to extract the text from the rows of a table from this site using xpaths to uniquely identify the table. Since the table itself can only be identified by its class name and given the fact that the class name isn't unique, I had to use the parent div element in order to order to specify the table. The table in question is that lists the dates of the season order, filming, and airdates for the show Game of thrones, which I'm trying to select with the following path:

tree.xpath('//div[@id = "mw-content-text"]//table[@class = "wikitable"]//text()')

由于某种原因,当我在外壳中打印此路径时,它返回一个空列表.我相信,打印此路径仅会显示我试图执行的表中的所有文本,以确保可以真正得到内容.但是,我实际上需要打印表的每一行.

For some reason, when I print this path in the shell, it returns an empty list. I believe that printing this path would simply display all of the text in the table which I was trying to do in order to ensure I could actually get the contents; however, I would actually need to print each row of the table.

此xpath出问题了吗?如果是这样,打印表格内容的正确方法是什么?

推荐答案

wikitable的类太广,无法区分Wiki页面上的表.

The wikitable is too broad of a class to distinguish tables on a wiki page between one another.

我将改用前面的Adaptation schedule标签:

import requests
from lxml.html import fromstring

url = "https://en.wikipedia.org/wiki/Game_of_Thrones"
response = requests.get(url)
root = fromstring(response.content)

table = root.xpath(".//h3[span = 'Adaptation schedule']/following-sibling::table")[0]
for row in table.xpath(".//tr")[1:]:
    print([cell.text_content() for cell in row.xpath(".//td")])

打印:

['Season 1', 'March 2, 2010[52]', 'Second half of 2010', 'April 17, 2011', 'June 19, 2011', 'A Game of Thrones']
['Season 2', 'April 19, 2011[53]', 'Second half of 2011', 'April 1, 2012', 'June 3, 2012', 'A Clash of Kings and some early chapters from A Storm of Swords[54]']
['Season 3', 'April 10, 2012[55]', 'Second half of 2012', 'March 31, 2013', 'June 9, 2013', 'About the first two-thirds of A Storm of Swords[56][57]']
['Season 4', 'April 2, 2013[58]', 'Second half of 2013', 'April 6, 2014', 'June 15, 2014', 'The remaining one-third of A Storm of Swords and some elements from A Feast for Crows and A Dance with Dragons[59]']
['Season 5', 'April 8, 2014[60]', 'Second half of 2014', 'April 12, 2015', 'June 14, 2015', 'A Feast for Crows, A Dance with Dragons and original content,[61] with some late chapters from A Storm of Swords[62] and elements from The Winds of Winter[63][64]']
['Season 6', 'April 8, 2014[60]', 'Second half of 2015', 'April 24, 2016', 'June 26, 2016', 'Original content and outlined from The Winds of Winter,[65][66] with some late elements from A Feast for Crows and A Dance with Dragons[67]']
['Season 7', 'April 21, 2016[50]', 'Second half of 2016[49]', 'Mid-2017[5]', 'Mid-2017[5]', 'Original content and outlined from The Winds of Winter and A Dream of Spring[66]']

这篇关于如何从html表格元素解析文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆