将从浏览器获取的 XPath 转换为 Scrapy 可用的 XPath [英] Convert the XPath gotten from browser to usable XPath for Scrapy

查看:25
本文介绍了将从浏览器获取的 XPath 转换为 Scrapy 可用的 XPath的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个问题,我总是用我的浏览器获取特定的 XPath.

This is a problem that I always have getting a specific XPath with my browser.

假设我想从某些网站(如 Google 图片搜索或 Pinterest)中提取所有图片.当我使用 Inspect element 然后使用 copy XPath 来获取图像的 XPath 时,它给了我如下的东西:

Assume that I want to extract all the images from some websites like Google Image Search or Pinterest. When I use Inspect element then use copy XPath to get the XPath for an image, it gives me some thing like following :

//*[@id="rg_s"]/div[13]/a/img

这是我从 Google 搜索的一张图片中得到的.当我想在我的 spider 中使用它时,我将 SelectorHtmlXPathSelector 与以下 XPath 一起使用,但它们都不起作用!

I got this from an image from Google Search. When I want to use it in my spider, I used Selector and HtmlXPathSelector with the following XPaths, but they all don't work!

//*[@id="rg_s"]/div/a/img
//div[@id="rg_s"]/div[13]/a/img
//[@class="rg_di rg_el"]/a/img #i change this based on the raw html of page 
#hxs.select(xpath).extract()
#Selector(response).xpath('xpath') 
.
.

我已经阅读了很多问题,但我找不到关于如何使用从 Scrapy 中的网络浏览器获取的 XPath 的一般答案.

I've read many questions, but I couldn't find a general answer to how I can use XPaths obtained from a web browser in Scrapy.

推荐答案

通常盲目遵循浏览器关于如何定位元素的建议是不安全可靠的.

Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.

首先,开发者工具生成的 XPath 表达式通常是绝对的——从所有父级的父级开始——html 标签,这使得它更依赖于页面结构(嗯,firebug 还可以根据 id 属性制作表达式).

First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).

此外,由于网站页面加载的异步性质和在浏览器中动态执行的 javascript,您在浏览器中看到的 HTML 代码可能与 Scrapy 收到的有很大不同.Scrapy 不是浏览器,它只看到"页面的初始 HTML 代码,在动态"部分之前.

Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.

相反,检查 Scrapy 在响应中的真正内容:打开 Scrapy Shell,检查响应并调试 XPath 表达式和 CSS 选择器:

Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:

$ scrapy shell https://google.com
>>> response.xpath('//div[@id="myid"]')
...

<小时>

这是我在谷歌图片搜索中得到的:


Here is what I've got for the google image search:

$ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
In [1]: response.xpath('//*[@id="ires"]//img/@src').extract()
Out[1]: 
[u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
 u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
 u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
 u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
 u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
 u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
 ...
 u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']

这篇关于将从浏览器获取的 XPath 转换为 Scrapy 可用的 XPath的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆