网页抓取,获取空列表 [英] Web scraping, getting empty list

查看:38
本文介绍了网页抓取,获取空列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很难用我的网页抓取代码找出正确的路径.

I have a hard time figuring out a correct path with my web scraping code.

我试图从 http://financials 中抓取不同的信息.Morningstar.com/company-profile/c.action?t=AAPL.我尝试了几种路径,有些似乎有效,有些则无效.我对操作详情下的 CIK 感兴趣

I am trying to scrape different info from http://financials.morningstar.com/company-profile/c.action?t=AAPL. I have tried several paths, and some seem to work and some not. I am interested in CIK under Operation Details

page = requests.get('http://financials.morningstar.com/company-profile/c.action?t=AAPL')
tree=html.fromstring(page.text)


#desc = tree.xpath('//div[@class="r_title"]/span[@class="gry"]/text()')  #works

#desc = tree.xpath('//div[@class="wrapper"]//div[@class="headerwrap"]//div[@class="h_Logo"]//div[@class="h_Logo_row1"]//div[@class="greeter"]/text()')    #works

#desc = tree.xpath('//div[@id="OAS_TopLeft"]//script[@type="text/javascript"]/text()')   #works

desc = tree.xpath('//div[@class="col2"]//div[@id="OperationDetails"]//table[@class="r_table1 r_txt2"]//tbody//tr//th[@class="row_lbl"]/text()')

我想不出最后的路径.似乎我正确地遵循了路径,但我得到了空列表.

I can't figure the last path. It seems like I am following the path correctly, but I get empty list.

推荐答案

问题在于操作细节是通过额外的 GET 请求单独加载的.在维护网络抓取会话的代码中模拟它:

The problem is that Operational Details are loaded separately with an additional GET request. Simulate it in your code maintaining a web-scrapin session:

import requests
from lxml import html


with requests.Session() as session:
    page = session.get('http://financials.morningstar.com/company-profile/c.action?t=AAPL')
    tree = html.fromstring(page.text)

    # get the operational details
    response = session.get("http://financials.morningstar.com/company-profile/component.action", params={
        "component": "OperationDetails",
        "t": "XNAS:AAPL",
        "region": "usa",
        "culture": "en-US",
        "cur": "",
        "_": "1444848178406"
    })

    tree_details = html.fromstring(response.content)
    print tree_details.xpath('.//th[@class="row_lbl"]//text()')

<小时>

旧答案:

只是你应该从表达式中删除 tbody :

It's just that you should remove tbody from the expression:

//div[@class="col2"]//div[@id="OperationDetails"]//table[@class="r_table1 r_txt2"]//tr//th[@class="row_lbl"]/text()

tbody浏览器插入的元素,用于定义表格中的数据行.

tbody is an element that is inserted by the browser to define the data rows in a table.

这篇关于网页抓取,获取空列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆