如何设置XPath查询以进行HTML解析? [英] How to set up XPath query for HTML parsing?

查看:94
本文介绍了如何设置XPath查询以进行HTML解析?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是 http://chem.cn中的一些HTML代码.我想在Google Chrome浏览器中解析sis.nlm.nih.gov/chemidplus/rn/75-07-0 来解析某个项目的网站.

Here is some HTML code from http://chem.sis.nlm.nih.gov/chemidplus/rn/75-07-0 in Google Chrome that I want to parse the website for some project.

<div id="names">
<h2>Names and Synonyms</h2>
<div class="ds"><button class="toggle1Col"title="Toggle display between 1 column of wider results and multiple columns.">&#8596;</button>
    <h3 id="yui_3_18_1_3_1434394159641_407">Name of Substance</h3>
    <ul>
        <li id="ds2">
        ``  <div>Acetaldehyde</div>
        </li>
    </ul>
</div>

我写了一个python脚本,通过抓住其中一个部分的名称来帮助我做这种事情,但是它并没有返回名称.我认为这是我的xpath查询,建议吗?

I wrote a python script to help me do such a thing by grabbing the name under one of the sections, but it just isn't returning the name. I think it's my xpath query, suggestions?

from lxml import html
import requests  
import csv 

names1 = []

page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/rn/75-07-0') 
tree = html.fromstring(page.text)

//This will grab the name data 

names = tree.xpath('//*[@id="yui_3_18_1_3_1434380225687_700"]')

//Print the name data 
print 'Names: ', names 

//Convert the data into a string  
names1.append(names)

//Print the bit length 

print len(names1)

//Write it to csv 

b = open('testchem.csv', 'wb')  
a = csv.writer(b)  
a.writerows(names1)
b.close()
print "The end"

推荐答案

重要的是检查page.text返回的字符串,而不要检查 只需依靠您的Chrome浏览器返回的页面源即可.网站可以 返回不同的内容,具体取决于User-Agent以及GUI浏览器 例如您的Chrome浏览器可能会通过执行JavaScript来更改内容,而 相反,requests.get不会.

It is important to inspect the string returned by page.text and not just rely on the page source as returned by your Chrome browser. Web sites can return different content depending on the User-Agent, and moreover, GUI browsers such as your Chrome browser may change the content by executing JavaScript while in contrast, requests.get does not.

如果将内容写入文件

import requests
page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/rn/75-07-0') 
with open('/tmp/test', 'wb') as f:
     f.write(page.text)

并使用文本编辑器搜索"yui_3_18_1_3_1434380225687_700" 您会发现没有带有该属性值的标签.

and use a text editor to search for "yui_3_18_1_3_1434380225687_700" you'll find that there is no tag with that attribute value.

如果您搜索Name of Substance,则会找到

<div><br>Search for this InChIKey on the <a href="http://www.google.com/search?q=%22IKHGUXGNUITLKF-UHFFFAOYSA-N%22" target="new" rel="nofollow">Web</a></div></div><div id="names"><h2>Names and Synonyms</h2><div class="ds"><button class="toggle1Col" title="Toggle display between 1 column of wider results and multiple columns.">&#8596;</button><h3>Name of Substance</h3><ul>
<li id="ds2"><div>Acetaldehyde</div></li>

因此,您可以使用:

In [219]: tree.xpath('//*[text()="Name of Substance"]/..//div')[0].text_content()
Out[219]: 'Acetaldehyde'


如何找到此XPath:

<h3>标记开始:

In [215]: tree.xpath('//*[text()="Name of Substance"]')
Out[215]: [<Element h3 at 0x7f5a290e85d0>]

我们想要的<div>标记不是一个子代,而是它是<h3>父代的子代.因此,请转到父项:

The <div> tag that we want is not a child but rather it is a subchild of the parent of <h3>. Therefore, go up to the parent:

In [216]: tree.xpath('//*[text()="Name of Substance"]/..')
Out[216]: [<Element div at 0x7f5a290f02b8>]

,然后使用//div搜索父级内部的所有<div>:

and then use //div to search for all <div>s inside the parent:

In [217]: tree.xpath('//*[text()="Name of Substance"]/..//div')
Out[217]: 
[<Element div at 0x7f5a290e88e8>,
 <Element div at 0x7f5a290e8940>,
 ...]

第一个div是我们想要的:

In [218]: tree.xpath('//*[text()="Name of Substance"]/..//div')[0]
Out[218]: <Element div at 0x7f5a290e88e8>

,我们可以使用text_content方法提取文本:

and we can extract the text using the text_content method:

In [219]: tree.xpath('//*[text()="Name of Substance"]/..//div')[0].text_content()
Out[219]: 'Acetaldehyde'

这篇关于如何设置XPath查询以进行HTML解析?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆