为什么从 chrome 派生的 xpath 不起作用 [英] why xpath derived from chrome does not work
问题描述
我正在尝试从
I am trying to scrap data from web of science
And here is the specific page I am going to work with.
Below is the code I use for extract the abstract:
import lxml
import requests
url = 'https://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=2&SID=Q1yAnqE4al4KxALF7RM&page=1&doc=3&cacheurlFromRightClick=no'
s = requests.Session()
d = s.get(url)
soup1 = etree.HTML(d.text)
And here is the xpath I got through the copy xpath in Chrome:
//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()
So I tried to get the abstract like this
path = '//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()'
print(soup1.xpath(path))
However, I just hot an empty list! Then I tried another way to test the xpath.
Firstly, I save the specific page as a local html file.
with open('1.html','w',encoding='UTF=8') as f:
f.write(d.text)
f.close()
Then, open the file
s.mount('file://',FileAdapter())
d = s.get('file:///K:/single_paper.html')
soup2 = etree.HTML(d.text)
soup2.xpath('//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()')
And it gives me the abstract I want! Could anyone tell me why that happens?
Weired when I try to do the steps with another page in the saving local file way, it returns an empty list again!
I checked that the xpath given by Chrome is the same for these two pages.
So could anyone tell me what's wrong with my code and how to fix it?
Browser given full Xpaths are usually unhelpful and you should use relative and clever ones based on attributes (such as id, class, etc) or any identifying features like contains(@href, 'image').
You could try more specific xpath expression: (//div[@class="block-record-info"])[2]/p/text()
and rewrite your code like this:
import requests
from lxml import html
url = 'https://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=2&SID=Q1yAnqE4al4KxALF7RM&page=1&doc=3&cacheurlFromRightClick=no'
s = requests.Session()
r = s.get(url)
tree = html.fromstring(r.content)
element = tree.xpath('(//div[@class="block-record-info"])[2]/p/text()')
print(element)
Output:
这篇关于为什么从 chrome 派生的 xpath 不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!