python:xpath 从 boxofficemojo.com 返回空列表 [英] python: xpath returns empty list from boxofficemojo.com

查看:29
本文介绍了python:xpath 从 boxofficemojo.com 返回空列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用以下代码从 BoxOfficeMojo.com 上每部电影的页面中抓取特定数据.不幸的是,xpath 返回一个空列表.一些帖子建议从 xpath 中删除 tbody,但这也会返回一个空列表.我使用相同的代码从 Rotten Tomatoes 和 IMBD 中提取文本,并且 xpath 运行良好.有谁知道为什么会发生这种情况以及如何解决?

I am trying to scrape specific data from each movie's page on BoxOfficeMojo.com using the code below. Unfortunately the xpath returns an empty list. Some posts suggest removing tbody from the xpath, but this also returns an empty list. I used the same code to pull text from Rotten Tomatoes and IMBD and the xpath worked fine. Does anyone know why this is happening and how this can be resolved?

from lxml import html
import requests

# Box Office Mojo Scrape
page = requests.get('http://www.boxofficemojo.com/movies/?page=main&id=ateam.htm')
tree = html.fromstring(page.text)

print tree.xpath('//*[@id="body"]/table[2]/tbody/tr/td/table[1]/tbody/tr/td[2]/table/tbody/tr/td/center/table/tbody/tr[1]/td/font/b/text()')
print tree.xpath('//*[@id="body"]/table[2]/tr/td/table[1]/tr/td[2]/table/tr/td/center/table/tr[1]/td/font/b/text()')

# Rotten Tomatoes Scrape
page2 = requests.get('http://www.rottentomatoes.com/m/star_wars_episode_vii_the_force_awakens/')
tree2 = html.fromstring(page2.text)

print tree2.xpath('//*[@id="scorePanel"]/div[2]/div[1]/a/div/div[2]/div[1]/span/text()')

# IMDB Scrape
page3 = requests.get('http://www.imdb.com/title/tt2488496/?ref_=nv_sr_1')
tree3 = html.fromstring(page3.text)

print tree3.xpath('//*[@id="overview-top"]/h1/span[1]/text()')

推荐答案

包含所需信息的表嵌套在另一个表中,依此类推.因此,尝试获取 //*[@id='body']/table[2] 是行不通的,因为该 div 中只有一张表(其他表嵌套在其中).

The table containing your desired information is nested inside another table and so on. Thus trying to get //*[@id='body']/table[2] is not going to work as there is only one table in that div (with other tables nested inside).

您可以使用极其笨拙的 xpath 表达式来获得它

You can obtain this with the EXTREMELY unwieldy xpath expression

//*[@id='body']/table/tr[2]/table/tr/td/table[1]/tr/td[2]/table/tr/td/center/table[1]/tr[1]/td/font/b/text()

请注意,所需信息包含在字体标签内的粗体标签内,文本Domestic Total Gross: 直接位于字体内.我将使用以下内容来获取该信息

Notice that the desired information is contained inside a bold tag inside a font tag with the text Domestic Total Gross: directly inside the font. I would use the following to get that information

//*[@id='body']//font[starts-with(normalize-space(.),'Domestic Total Gross:')]/b/text()

如果表结构发生变化,这也不会那么脆弱.

This is also less fragile if the table structures change.

这篇关于python:xpath 从 boxofficemojo.com 返回空列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆