如何确定正确的 XPath? [英] How do I determine the correct XPath?
问题描述
我正在 Scrapy 中为 wunderground.com 制作一个网页抓取工具,但是我选择的许多不同的 XPath 都返回空数组.我发现了一个关于同一主题的不同问题 (此处),这实际上是我将代码切换到 wunderground.com 的原因.但是,给出的答案是专门针对一个确切对象的.如何确定其他对象的正确 XPath?
I am making a web scraper in Scrapy for wunderground.com, but many different XPaths that I choose return empty arrays. I found a different question on the same topic (here), which is actually why I switched my code to wunderground.com. However, the answer given is specifically directed at one exact object. How could I determine the correct XPaths for the other objects?
代码如下:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
import time
from wunderground_scraper.items import WundergroundScraperItem
class WundergroundComSpider(scrapy.Spider):
name = "wunderground"
allowed_domains = ["www.wunderground.com"]
start_urls = (
'http://www.wunderground.com/q/zmw:10001.5.99999',
)
def parse(self, response):
info_set = Selector(response).xpath('//div[@id="current"]')
list = []
for i in info_set:
item = WundergroundScraperItem()
# WORKS FINE
item['temperature'] = i.xpath('div/div/div/div/span/span/text()').extract()
item['temperature'] = item['temperature'][0]
# EDITED XPATH FROM OTHER QUESTION
item['humidity'] = i.xpath('.//td[dfn="Humidity"]/following-sibling::td//text()').extract()
item['humidity'] = item['humidity'][2]
# RETURNS EMPTY ARRAY
item['chance_rain'] = i.xpath('div/div/div/div/a/strong/text()').extract()
list.append(item)
return list
推荐答案
通常,如何确定正确的 XPath 表达式"的答案要么是通过检查"(即查看您'正在尝试查询,或通过反复试验"(从一般表达式开始,然后缩小范围,直到得到你想要的).
Generally, the answer to "how do I determine the correct XPath expression" is going to be either "by inspection" (that is, look at the document you're trying to query, or "through trial and error" (start with general expressions and then narrow them down until you get what you want).
在这种情况下,您遇到了一个非常常见的问题:您在浏览器中看到的页面是使用 Javascript 本地部分呈现的.包含沉淀机会的元素作为 <script>
资源的一部分包含在内,从您的 XML 解析器的角度来看,它是 (a) 只是一个不透明的文本块,(b) 不'甚至不包含您要查找的信息,因为它需要先由脚本填写.直到使用 Javascript 呈现页面时,元素才在文档中实际实例化.
In this case, you've run into a very common problem: the page you see in your browser is partially rendered locally using Javascript. The element that contains the chance of precipitation is include as part of a <script>
resource, which from the perspective of your XML parser is (a) simply an opaque blob of text and (b) doesn't even contain the information you're looking for because it needs to be filled in by the script first. It's not until the page is rendered with Javascript that the the element is actually instantiated in the document.
不可能从文档源中提取这些数据.
It's not going to be possible to extract this data from the document source.
这篇关于如何确定正确的 XPath?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!