为什么从 chrome 派生的 xpath 不起作用 [英] why xpath derived from chrome does not work

查看:27
本文介绍了为什么从 chrome 派生的 xpath 不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从

I am trying to scrap data from web of science

And here is the specific page I am going to work with.

Below is the code I use for extract the abstract:

import lxml
import requests

url = 'https://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=2&SID=Q1yAnqE4al4KxALF7RM&page=1&doc=3&cacheurlFromRightClick=no'
s = requests.Session()
d = s.get(url)
soup1 = etree.HTML(d.text)

And here is the xpath I got through the copy xpath in Chrome:

//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()

So I tried to get the abstract like this

path = '//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()'   
print(soup1.xpath(path))

However, I just hot an empty list! Then I tried another way to test the xpath.

Firstly, I save the specific page as a local html file.

with open('1.html','w',encoding='UTF=8') as f:
    f.write(d.text)
f.close()

Then, open the file

s.mount('file://',FileAdapter())
d = s.get('file:///K:/single_paper.html')
soup2 = etree.HTML(d.text)
soup2.xpath('//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()')

And it gives me the abstract I want! Could anyone tell me why that happens?

Weired when I try to do the steps with another page in the saving local file way, it returns an empty list again!

I checked that the xpath given by Chrome is the same for these two pages.

So could anyone tell me what's wrong with my code and how to fix it?

解决方案

Browser given full Xpaths are usually unhelpful and you should use relative and clever ones based on attributes (such as id, class, etc) or any identifying features like contains(@href, 'image').

You could try more specific xpath expression: (//div[@class="block-record-info"])[2]/p/text() and rewrite your code like this:

import requests
from lxml import html

url = 'https://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=2&SID=Q1yAnqE4al4KxALF7RM&page=1&doc=3&cacheurlFromRightClick=no'
s = requests.Session()
r = s.get(url)
tree = html.fromstring(r.content)
element = tree.xpath('(//div[@class="block-record-info"])[2]/p/text()')
print(element)

Output:

这篇关于为什么从 chrome 派生的 xpath 不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆