lxml xpath无法显示html项目 [英] lxml xpath unable to display html items

查看:224
本文介绍了lxml xpath无法显示html项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用lxml解析下面的网页.但是我的xpath似乎出了点问题.我不确定我在做什么错.

I'm trying to use lxml to parse a webpage below. But something seems to be wrong with my xpath. I'm not sure what am I doing wrong.

web_content = requests.get(r"https://www.quandl.com/data/TSE").content
dataset_count = html.fromstring(web_content)
print(dataset_count.xpath(r'//*[@id="ember667"]/div[2]/main/section/section/section[2]/div[3]/div[2]/span[2]'))

我正试图让它返回3908的数据集号.但是这个xpath似乎对我不起作用.有什么想法吗?

I'm trying to get it to return the dataset number of 3908. But this xpath doesn't seem to work for me. Any thoughts?

此外,我希望如果我通过请求传递另一个quandl链接,我可以使用相同的xpath来绘制数据集编号.有可能吗?

Also, I'm hoping that if I pass another quandl link through requests, I can use the same xpath to draw out the dataset number. Would that be possible?

推荐答案

似乎数据集计数也在<noscript>元素中:

It seems the datasets count is also in a <noscript> element:

<div class='centered' id='main' role='main'>
<div id='content'>
<noscript>
<table>
<tbody>
<tr>
<td>Database Name</td>
<td>Tokyo Stock Exchange</td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td>Datasets</td>
<td>3908</td>
</tr>
<tr>
<td>Downloads</td>
<td>4067259</td>
</tr>
<tr>
...

因此您可以使用类似的方法来获取它:

So you can grab that using something like this:

>>> import requests
>>> import lxml.html

>>> r = requests.get('https://www.quandl.com/data/TSE')
>>> h = lxml.html.fromstring(r.text)
>>> h
<Element html at 0x7ffb5f6ed0a8>

>>> h.xpath('//noscript')
[<Element noscript at 0x7ffb5c16ac58>, <Element noscript at 0x7ffb5c16ac00>]

>>> h.xpath('string(//noscript//tr[td[1]="Datasets"]/td[2])')
'3908'
>>> h.xpath('string(//div[@id="content"]//noscript//tr[td[1]="Datasets"]/td[2])')
'3908'
>>> h.xpath('number(//div[@id="content"]//noscript//tr[td[1]="Datasets"]/td[2])')
3908.0

按照OP的要求在XPath上进行解释:

Explanation on the XPath as requested by OP:

//div[@id="content"]          <-- look for a <div> element with "id" attribute equal to "content"
  //noscript                  <-- look for a <noscript> descendant
    //tr[                     <-- look for a <tr> descendant...
        td[1]="Datasets"      <-- ... which 1st <td> child string value is "Datasets"...
                              (this is true if the <td> contains only 1 text node "Datasets"
        ]
      /td[2]                  <-- select the 2nd <td> of previous matching <tr> rows

这篇关于lxml xpath无法显示html项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆